Sparsely Shared LoRA on Whisper for Child Speech Recognition - HackMD

<style> .red { color: red; } .blue{ color: blue; } .green{ color: green; } </style> # [Sparsely Shared LoRA on Whisper for Child Speech Recognition](https://arxiv.org/abs/2309.11756) ## 1. Introduction - Automatic speech recognition (ASR) nowadays has become very powerful. Benefiting from large amounts of training data. - Recognition of low-resource speech, such as child speech, disorder speech, and speech from the endangered languages, has persistently presented formidable challenges. - The common practice for adaptation is to collect in-domain data and conduct model fine-tuning: 1. The target domain data for low-resource speech is very limited. 2. With the larger model, full fine-tuning becomes increasingly computational costing, and the fine-tuned model tends to overfit. - Recently, parameter-efficient fine tuning (PEFT) has attained great attention in the NLP community, mainly for large language models (LLMs). - PEFT approaches can be roughly divided into three categories: 1. <span class='red'>Function composition</span>: like adapter 2. <span class='red'>Input composition</span>: like prompt tuning and prefix tuning 3. <span class='red'>Parameter composition</span>: Low-rank adaptation (LoRA), BitFit - These approaches introduce a small proportion of learnable parameters in different forms to the original model. - **Adapters** directly insert lightweight neural network modules into the pre-trained model, which changes the model architecture. - **Prompt tuning** and **prefix tuning** add leading learnable embeddings to the input sequence, resulting in the increased inference cost due to longer input sequences. - **Parameter composition**’s PEFT carries no additional inference cost once it has finished training. This type of approach enables the trainable parameters to be seamlessly integrated into the original weights. :::success Thereby promoting **parameter composition** as a prevailing fine-tuning strategy. ::: - It is known that LoRA manually specifies a fixed rank for all the weight matrices, overlooking the varying importance between different weights. :::info 這段文章解釋了LoRA（Low-Rank Adaptation）方法中存在的一個缺陷，並提出了一種改進方法，即AdaLoRA，用於根據權重的重要性動態分配階數。下面是對文章中每個步驟的詳細解釋： 1. **LoRA是一種參數高效微調的方法**： LoRA是一種用於大型預訓練模型的微調方法，其主要特點是可以高效地調整模型參數以適應新的任務，而不需要大量的訓練數據。這種方法特別適用於資源有限的情況下。 2. **在LoRA中，權重矩陣的更新是通過low-rank decomposed矩陣來實現的**： LoRA通過對模型的權重矩陣進行low-rank decomposition，以減少可訓練參數的數量，從而實現模型的高效微調(PEFT)。 3. **LoRA為所有權重矩陣指定了相同的固定階數值**： LoRA的一個限制是它將所有權重矩陣的階數都設置為相同的固定值，這意味著無論權重的重要性如何，都會使用相同的階數進行分解。 4. **不同的權重矩陣對模型性能的重要性是不同的**：不同的權重矩陣對模型的性能貢獻程度可能不同。因此，一個合理的方法應該根據權重的重要性分配不同的階數資源。 5. **LoRA手動指定固定階數的做法忽略了權重之間重要性的差異**：由於LoRA將所有權重矩陣的階數設置為固定值，這就忽略了權重之間重要性的差異，可能導致資源分配不合理，進而影響模型的適應性能。總的來說，文章指出了LoRA在權重階數分配上的缺陷，提出了一種改進方法，即AdaLoRA，該方法可以根據權重的重要性動態分配階數資源，從而更好地適應目標任務。 ::: - AdaLoRA addresses the issue with an importance-aware rank allocator. It maintains an overall rank budget and dynamically distributes the ranks to weight matrices according to an importance metric. :::info AdaLoRA透過一個重要性感知的階數分配器來解決LoRA固定階數的問題。它設置了一個總體的階數預算,並根據一個重要性指標動態地將階數分配給各個權重矩陣。解釋如下: 1. AdaLoRA是LoRA的改進版本,旨在更好地為權重矩陣分配適當的階數資源。 2. 它引入了一個"重要性感知的階數分配器"(importance-aware rank allocator),用於根據權重矩陣的重要程度動態分配階數資源。 3. 這個分配器首先設置了一個總體的"階數預算"(rank budget),代表可用於整個模型的最大可分配階數總和。 4. 然後,它根據一個"重要性指標"(importance metric)來衡量每個權重矩陣對模型性能的重要貢獻程度。 5. 對於重要性較高的權重矩陣,分配器會動態地從總預算中分配更多的階數資源給它們。 6. 而對於重要性較低的矩陣,分配器則只會分配較少的階數。 7. 通過這種動態的、基於重要性的階數分配機制,AdaLoRA可以更有效地利用有限的階數預算,讓重要權重獲得更多資源,提高模型的適應性能。總之,AdaLoRA透過引入重要性感知分配器,實現了對權重矩陣階數的動態、合理分配,從而改善了LoRA的缺陷。 ::: ## 2. S2-LoRA on Whisper Model ### 2.1 Whisper - Whisper has a typical transformer-based encoder-decoder architecture that is designed for multiple speech tasks, including multiligual ASR, multilingual speech translation, language identification, etc. - The **transformer structure** consists of three submodules as building blocks, namely the self-attention module (SAM), feed- forward module (FFM), and cross-attention module (CAM). <span class='red'>The encoder layer comprises SAM and FFM, while the decoder layer in- cludes all of the three.</span> - The learnable weight matrices in both **SAM** and **CAM** are {W$_q$ , W$_k$ , W$_v$ , W$_o$ }, which represents the query, key, value, and output projection matrices, respectively. In **FFM**, {W$_{fc1}$ , W$_{fc2}$ } are the projection matrices of the fully-connected layers. - Whisper is <span class='red'>trained on around 680,000 hours of **weakly-supervised** speech data</span> collected from the Internet, with the performance approaching human-level accuracy and robustness. :::info 「weakly-supervised（弱監督）」是指在訓練模型時，所使用的標註或監督訊息相對不完整或不準確的情況。與傳統的監督學習相比，弱監督學習的訓練資料可能缺乏準確的標籤或只有部分標註，因此模型需要學習如何從這些資料中提取有用的訊息。 ::: ### 2.2 LoRA - LoRA was first proposed in NLP to efficiently adapt large language models (LLMs) to specific domains or downstream tasks. - It was found that the weights of pre-trained LLMs tend to reside in a low intrinsic dimensional space. :::info 這句話指出,預先訓練的大型語言模型(Large Language Models, LLMs)的權重傾向於位於一個低維內在空間中。更詳細的解釋如下: 1. 大型語言模型通常包含大量的參數(權重),規模可能高達數十億甚至數百億個參數。 2. 然而,研究發現,儘管這些模型參數的總數很大,但它們實際上傾向於分佈在一個低維度的"內在空間"中。 3. 所謂"內在空間"是指模型參數所構成的向量空間中的一個子空間,這個子空間的維度相對於參數總數來說是較低的。 4. 也就是說,儘管模型看似擁有非常高的自由度(高維度),但實際上大部分參數之間是有強烈相關性的,它們所能到達的空間是有限的低維度空間。 5. 這種現象可能源於訓練數據的一些內在特徵、模型架構的一些先驗假設,或者訓練目標函數的某些性質所導致。 6. 發現這一性質很重要,因為它意味著我們可以利用較低維度的參數空間來表示和微調這些大型模型,從而大大減少所需的計算資源,提高效率。 7. 基於這一發現,一些參數高效微調方法(如LoRA、AdaLoRA等)便應運而生,試圖在低維參數空間中高效地調整預訓練模型。總之,這句話揭示了大型語言模型內部存在的一種低維度表徵的特性,為高效微調這類模型提供了理論基礎。 ::: - Inspired by the observation, LoRA freezes the original weights and only updates the low-rank incremental weight matrices. - By applying LoRA, the forward is modified as: **f$_i$(x) = x(W$_i$ + ∆W$_i$)$^T$ + b$_i$ ; ∆W$_i$ = B$_i$A$_i$** , where B$_i$ ∈ R$_{d2×r}$ and A$_i$ ∈ R$_{r×d1}$ are the two trainable rank-decomposed matrices, with the rank r ≪ min{d$_1$ , d$_2$ }. ### 2.3 AdaLoRA - The importance of weight parameters to the performance varies across different layers and modules. Intuitively, some weight matrices should be allocated with higher ranks than others in adaptation. :::info 這句話的意思是,不同layer和module中的權重參數對模型性能的重要性是不同的。直觀上來說,在模型適應(adaptation)過程中,應該為某些權重矩陣分配更高的階數,而為其他矩陣分配較低的階數。更詳細的解釋如下: 1. 大型神經網路模型通常由多個層(layers)和模塊(modules)組成,每個layer和modules中都包含大量的權重參數(權重矩陣)。 2. 然而,不同layer和module在模型中扮演的角色和作用是不同的,因此其中的權重參數對最終模型性能的影響程度也是不盡相同。 3. 一些layer和module可能處理較為關鍵的特徵提取或訊息整合任務,其中的權重參數對模型性能至關重要。 4. 而另一些layer和module可能只承擔輔助性的計算任務,其權重參數對整體性能的影響則相對較小。 5. 因此,在對大型模型進行適應性微調(如domain adaptation)時,我們應該根據不同權重參數的重要程度,為它們分配不同數量的可更新資源(如階數rank)。 6. 對於那些對性能影響較大的關鍵權重矩陣,應該分配較高的階數,使其擁有更大的可更新能力。 7. 而對於那些相對不太重要的權重矩陣,則可以分配較低的階數,節省可更新資源。 8. 通過這種重要性感知(importance-aware)的階數分配策略,可以在有限的可更新資源預算下,最大化模型適應性微調的效果。總之,這句話強調了在大型模型微調過程中,根據權重的重要性動態分配階數資源的必要性,為後續提出AdaLoRA等方法提供了合理性基礎。 ::: - However, **LoRA** specifies a fixed rank for all the weight matrices, which overlooks the varying importance of weights and can be sub-optimal. - **AdaLoRA** addresses this issue with an importance-aware rank allocation method. It comes with two modifications to LoRA. 1. First, the incremental update ∆W$_i$ is parameterized in the form of singular value decomposition (SVD), i.e., **∆W$_i$ = $\overline{B}$$_i$ Λ$_i$ $\overline{A}$$_i$ ; $\overline{B}$${^T_i}$$\overline{B}$$_i$ = $\overline{A}$${^T_i}$$\overline{A}$$_i$ = I,** :::info 這個公式描述了AdaLoRA方法中，針對第i個權重矩陣W$_i$的增量更新ΔW$_i$的參數化形式。具體解釋如下: 1. ΔW$_i$表示對原始權重矩陣Wi的增量更新。 2. $\overline{B}$$_i$ ∈ R$_{d2×r}$和$\overline{A}$$_i$ ∈ R$_{r×d1}$分別是ΔW$_i$的左、右奇異向量矩陣,r << min{d$_1$, d$_2$}是一個小於d1和d2的階數值。 3. Λ$_i$ ∈ R$_r×r$是一個對角矩陣,對角線元素表示ΔW$_i$的奇異值。 4. 通過奇異值分解(SVD)形式,ΔW$_i$可以表示為三個矩陣的乘積: ∆W$_i$ = $\overline{B}$$_i$ Λ$_i$ $\overline{A}$$_i$。 5. 公式的第二部分$\overline{B}$${^T_i}$$\overline{B}$$_i$ = $\overline{A}$${^T_i}$$\overline{A}$$_i$ = I,表示$\overline{B}$$_i$和$\overline{A}$$_i$都是標準正交矩陣,即各自的轉置乘以自身等於單位矩陣。 6. 這樣參數化有兩個好處: a) 限制$\overline{B}$$_i$和$\overline{A}$$_i$為標準正交矩陣(orthogonal),可以增加數值穩定性; b) 通過對角矩陵Λi來控制奇異值大小,可以方便地進行低秩近似和階數分配。 7. 在AdaLoRA中,只有$\overline{B}$$_i$ 、$\overline{A}$$_i$和對角線向量Λ$_i$ 是可訓練的參數,通過學習這些參數來求解ΔW$_i$,從而適應目標任務。 8. 相比原始LoRA直接學習兩個low-rank分解矩陣,AdaLoRA的參數化形式增加了奇異值的明確控制,為後面的importance-aware階數分配提供了基礎。總之,AdaLoRA對原始權重矩陣增量更新的新參數化方式,為精細控制階數分配和提高訓練穩定性奠定了基礎。 ::: 2. Second, in adaptation, AdaLoRA dynamically allocates an over-all rank budget to its update matrices {∆W$_i$}. This is achieved by iteratively masking out less important singular values after every gradient update step. :::info 這一句描述了AdaLoRA在適應性微調過程中是如何動態分配總體的階數預算給各個更新矩陣{ΔW$_i$}的。具體來說: 1. AdaLoRA設置了一個總體的"階數預算"(rank budget),代表可供分配的最大階數總和。 2. 在每次梯度更新步驟之後,AdaLoRA會迭代地從各個更新矩陣ΔWi中"遮罩掉"(mask out)重要性較低的奇異值。 3. 所謂"遮罩"是指將這些較不重要的奇異值設置為0,等價於降低了ΔW$_i$的有效階數。 4. 這樣做的目的是動態地調整每個ΔW$_i$所獲分配的階數資源量,使重要的更新矩陣可以獲得更多的階數預算。 5. 判斷每個奇異值重要性的依據是一個"重要性度量"(importance metric),例如文中提到的基於敏感度的指標。 6. 具體而言,在每次梯度更新後,AdaLoRA會按照重要性度量統計出所有ΔW$_i$中的奇異值重要性排序。 7. 然後根據預設的總體階數預算,只保留重要性最高的那些奇異值,其餘的都被遮罩為0。 8. 通過這種迭代遮罩的操作,AdaLoRA可以自適應地從有限的階數預算中,為重要的更新位置分配更多資源。 9. 相比原始LoRA使用固定的統一階數,這種動態分配機制更有針對性,可以更高效地利用有限資源。總之,這一設計使得AdaLoRA能夠根據各更新位置的重要程度,動態調整它們所獲得的階數資源,從而提升微調效率和性能。 ::: 3. A **sensitivity-based importance metric** is utilized to measure and sort the importance of the k-th triplet {Λ$^{k,k}$ , $\overline{B}$${^{*,k}_i}$ , $\overline{A}$${^{k,*}_i}$ } of the i-th weight matrix Wi , which takes account of both singular values and vectors. The non-zero Λ$^{k,k}_i$ acts as the rank coefficient to control the allocated rank budget. ### 2.4 Proposed S2-LoRA - Our preliminary experiments on AdaLoRA show that: 1. It tends to allocate significantly fewer rank budgets to {W$_q$ ,W$_k$ } than those to {W$_v$,W$_o$} in both SAM and CAM. 2. The sparsity of rank budgets among layers exists. :::info 這段話總結了一些預實驗對AdaLoRA階數分配行為的發現,具體包括以下兩點: 1. AdaLoRA傾向於為自注意力模塊(SAM)和交叉注意力模塊(CAM)中的查詢(Wq)和鍵(Wk)權重矩陣分配較少的階數預算,而為值(Wv)和輸出(Wo)權重矩陣分配更多的階數預算。 - 在Transformer架構中,SAM和CAM會將輸入序列分解為查詢(Q)、鍵(K)和值(V)表示,並通過權重矩陣Wq、Wk和Wv來計算它們。 - 實驗發現,AdaLoRA認為Wv和Wo對模型性能的影響更大,因此給予它們更多的可更新階數資源。 - 而對Wq和Wk的階數分配則相對較少,認為它們對性能的影響較小。 2. 在不同層(layers)之間,AdaLoRA分配的階數預算存在稀疏性(sparsity)。 - 神經網路模型通常由許多層所組成,每層都包含大量權重參數。 - 實驗觀察到,AdaLoRA並非將有限的階數預算均勻分配給每一層。 - 相反,不同層獲得的階數資源分配是不均勻的,呈現出一定的稀疏模式。 - 也就是說,某些層獲得了較多的階數預算,而另一些層則獲得了較少。這些發現說明,AdaLoRA並非簡單地平均分配階數資源,而是根據不同模塊和層的重要性進行動態調整和分配,以期獲得更好的微調效果。這也為後續提出稀疏共享低秩適應(S2-LoRA)方法獻計獻策。 ::: - In this regard, parameters for adaptation can be further reduced. It may not be necessary to learn the rank-decomposed matrices B$_i$,A$_i$ separately for each weight matrix W$_i$. They can be tied and shared across layers and modules. :::info 這段話提出了一種進一步減少適應過程中可訓練參數數量的想法,即不再為每個權重矩陣W$_i$分別學習low-rank decomposed矩陣Bi和Ai,而是將它們在layer和module之間綁定和共享。詳細解釋如下: 1. 在LoRA和AdaLoRA等方法中,每個權重矩陣W$_i$都需要對應兩個可訓練的low-rank矩陣B$_i$和A$_i$,用於構建增量更新ΔW$_i$。 2. 隨著模型規模的增大,這些需要獨立學習的B$_i$和A$_i$矩陣的數量也會成倍增加,導致總參數量仍相當可觀。 3. 作者認為,我們可能不需要為每個W$_i$分別學習獨立的B$_i$和A$_i$,而是可以將它們在不同的layers和modules之間綁定和共享。 4. 這是基於實驗觀察到,AdaLoRA的階數分配在不同層級和模塊之間存在一定的相似性和規律性。 5. 如果B$_i$和A$_i$可以在layer和module之間共享,那麼我們只需為每個W$_i$維護一個簡單的階數係數向量,來調整該矩陣所獲得的low-rank decomposed resource。 6. 通過這種參數綁定和共享的策略,可以極大地減少需要學習的可訓練參數總量,提高參數效率。 7. 例如,如果讓同一module中所有權重共享同一對B$_i$和A$_i$,那麼整個模塊只需要維護這一對共享矩陣,而非為每個W$_i$分別學習,參數量可以大幅減少。 8. 當然,過度共享也可能會損失一定的模型表現力,所以需要權衡參數量和性能之間的平衡。總之,這段論述為後續提出的S2-LoRA(Sparsely Shared LoRA)方法提供了設計靈感,即通過參數共享來極大減少可訓練參數的數量,從而進一步提高參數效率。 ::: - S2-LoRA has B and A globally shared. The i-th weight matrix only stores a single trainable rank coefficient vector si. In addition, S2-LoRA introduces an L1 constraint on the rank vector for sparsity. :::info 這段描述了文章提出的S2-LoRA(Sparsely Shared LoRA)方法的核心設計思路。具體來說: 1. 在S2-LoRA中,low-rank decomposed矩陣B和A是全局共享的,而不是為每個權重矩陣W$_i$分別學習。 2. 對於第i個權重矩陣W$_i$,它只需要存儲一個單一的可訓練階數係數向量si,而不需要額外存儲B和A矩陣。 3. s$_i$向量中的每個元素控制W$_i$對應的low-rank投影在各個階數上的強度。 4. 通過learned s$_i$調節W$_i$的low-rank投影強度,可以間接控制ΔW$_i$對W$_i$的更新幅度,起到微調的作用。 5. 與LoRA和AdaLoRA不同,S2-LoRA在s$_i$向量上引入了L1範數約束,將其值限制在一個小常數ε內。 6. 這個L1約束可以促使大部分si元素接近0,從而使階數分配向量si變得更加稀疏。 7. 稀疏的s$_i$意味著,大多數階數上W$_i$的low rank投影都接近0,只有少數重要的階數對應有效投影。 8. 這種稀疏性質源自實驗觀察到,在AdaLoRA中不同W$_i$獲得的階數預算分配呈現明顯稀疏模式。 9. 通過全局共享B和A,再結合稀疏化的s$_i$向量,S2-LoRA可以極大節省參數量,但仍能保留關鍵的階數投影能力。 10. 相比LoRA和AdaLoRA,在保持可比性能的前提下,S2-LoRA只需極少量的可訓練參數(0.02%)。總之,S2-LoRA通過全局參數共享和階數向量稀疏化的設計,成功將參數需求大幅降低,從而進一步提升了參數的使用效率。 ::: - The orthogonal constraint proposed in Eq. 2 is discarded. - The objective function of S2-LoRA can br writen as: ![截圖 2024-03-22 23.08.07](https://hackmd.io/_uploads/B1de1msC6.png =60%x) :::info 讓我們詳細解釋每一項: 1. LCE(D|B, A, {Si}Ni=1) - 這是經典的交叉熵損失函數 - D代表適應性微調所使用的數據集 - {Si}Ni=1代表N個需要微調的權重矩陣對應的階數係數向量集合 - B和A是全局共享的低秩分解矩陣 - 此項目的作用是使模型在D上的預測儘可能符合真實標籤 2. α1 * (1/N) * ΣNi=1||Si||1 - 這是S2-LoRA的關鍵創新,即對階數係數向量{Si}施加L1範數正則化 - ||Si||1為Si的L1範數,表示向量元素絕對值之和 - 1/N為平均操作,使該項對每個Si的影響一致 - α1控制了L1正則化的強度 3. α2 * (1/2r) * Σrk=1(||B*,k|| + ||Ak,*||) - 這是對共享矩陣B和A施加的L2範數正則化 - ||B*,k||和||Ak,*||分別為B和A的第k列和第k行向量的L2範數 - 1/2r為平均操作,使該正則化對每個奇異向量影響一致 - α2控制了L2正則化的強度總的來說,Ltotal由三部分組成: 1) 數據適應的交叉熵損失 2) 對階數係數向量{Si}的L1稀疏性約束 3) 對共享矩陣{B,A}的L2範數正則化這種損失函數設計使得S2-LoRA不僅可以完成模型適應任務,還能自動學習出稀疏的階數係數向量,並避免共享矩陣過度膨脹,達到極高的參數使用效率。 ::: ![截圖 2024-03-22 23.17.20](https://hackmd.io/_uploads/Hyyv-QjCT.png) - The {B, A} of S2-LoRA are not globally shared across the whole Whisper model. The functionality of the encoder and decoder in Whipser are different. - The three basic blocks, SAM, CAM, and FFM, act as various modeling roles. Consequently, we split Whisper into 5 modules, namely, Enc-SAM, Enc-FFM, Dec-SAM, Dec-CAM, and Dec-FFM. {B, A} are tied only within each module. :::info 這段話解釋了S2-LoRA在Whisper模型上的具體實現細節。主要內容如下: 1. Whisper是一個典型的transformer編解碼器架構,包含編碼器(encoder)和解碼器(decoder)兩部分。 2. 編解碼器中有三種基本module:自注意力模塊(SAM)、交叉注意力模塊(CAM)和前饋模塊(FFM)。 3. 這三種module在模型中扮演著不同的建模角色,對最終的模型性能貢獻也不盡相同。 4. 因此,作者將Whisper模型分割為5個module:編碼器自注意力(Enc-SAM)、編碼器前饋(Enc-FFM)、解碼器自注意力(Dec-SAM)、解碼器交叉注意力(Dec-CAM)、解碼器前饋(Dec-FFM)。 5. 在S2-LoRA中,low-rank分解的矩陣B和A是在每個module內部共享的,而不是全局共享。 6. 換句話說,Enc-SAM、Enc-FFM、Dec-SAM、Dec-CAM和Dec-FFM這五個module各自維護一對B和A矩陣。 7. 這樣做的原因是編碼器和解碼器、以及不同類型的module,在功能上存在一定差異,共享同一對B和A矩陣可能會過於約束。 8. 將B和A綁定在module級別,可以在參數共享和建模能力之間達成平衡。 9. 對於每個module內的所有權重矩陣,它們共享該module的B和A矩陣,只需各自維護一個稀疏的階數係數向量即可。 10. 這種綁定策略大幅減少了可訓練參數的數量,同時也賦予了不同模塊一定的自主建模能力。總之,通過將Whisper分割為5個module,並在模塊內部共享矩陣{B,A},S2-LoRA巧妙地平衡了參數效率和建模能力,達到了理想的適應性微調效果。 ::: ## 3. Experiments Setup ### 3.1 Datasets - The Chinese child speech for Whisper adaptation comes from the subsets of CSRC-2021 dataset. - CSRC-2021 contains 28.6 hours of child-read speech (zh-C1) and 29.5 hours of child conversational speech (zh-C2). - <span class='red'>The zh-C1 is the training set in the target domain</span>, which has around 30K utterances. - To investigate the effect of different amounts of adaptation data, the zh-C1 was split into three sets: namely zh-C1-1K, zh-C1-10K, and the original zh-C1-30K. - In the **evaluation stage**, three Chinese Mandarin test sets were prepared. They are child read/conversational speech from the evaluation sets of CSRC-2021 and **adult read speech from Aishell1 test set** (zh-A1). In addition, Cantonese (zh-HK) and English (en) from Common Voice 11.0 test sets are utilized to evaluate the cross-lingual generalization. Each of the above test sets contains the randomly selected 1,000 utterances for fast evaluation. ### 3.2 PEFT Configuration - The PEFT methods we compared belong to the category of parameter composition: 1. **BitFit**: update all the bias parameters of the model. 2. **IA3**: introduce three learned vectors per layer to elementwisely multiply with the projection output from {W$_k$ , W$_v$ , W$_fc2$ }. 3. **LoRA**: model incremental updates of {W$_q$,W$_v$} with the low-rank matrices where rank r = 8. 4. **GLoRA** : a generalized LoRA that gives a unified formulation encompassing all tunable dimensions. In GLoRA, ∆W$_i$ = W$_i$A$_i$ +B$_i$, ∆b$_i$ = C$_i$W$_i$ +D$_i$b$_i$ +E$_i$, where {A$_i$, B$_i$, C$_i$, D$_i$, E$_i$} are trainable parameters that can be further low-rank decomposed for matrices (r = 8). Note that GLoRA becomes LoRA if only having B and becomes BitFit if only E exists. 5. **AdaLoRA** : dynamically allocate the rank budget to weight matrices of {W$_q$ , W$_v$ }, where the initial rank is set to 12 and target rank is 8. 6. **S2-LoRA**: propose to sparsely share the low-rank components (r = 8), every weight matrix only need to maintain a rank coefficient vector s. In Eq. 4, the sparsity weight α1 and regularization weight α2 is set to 0.05 and 0.1 respectively. ### 3.3 Whisper Tuning - Whisper has five different sizes: 1. tiny (39M) 2. base (74M) 3. small (244M) 4. medium (769M) 5. large (1.5B). - The <span class='red'>Huggingface trainer is used</span> to fine-tune the model. - Since the speech input to Whisper would be padded to 30 seconds, batch size=2 is used with 8 gradient accumulations. - Mixed precision training is enabled. - The number of epochs is set to 3. - The learning rate is 1e-3 in PEFT mode while using 1e-4 under full fine-tuning. - Greedy decoding is applied for ASR inference. - Character error rate (CER) is used to evaluate Chinese languages and word error rate (WER) is used for English. ## 4. Results and Anaysis ### 4.1 Why we need PEFT on Whisper? - To prove the necessity of PEFT for Whisper, the performance comparison between Full-FT and PEFT is carried out. - AdaLoRA (r = 8) is used as a representative PEFT method in this experiment. ![截圖 2024-03-23 01.32.52](https://hackmd.io/_uploads/Bk-qWBsAa.png =70%x) :::success - Whisper with PEFT clearly outperforms that of using Full-FT on the zh-C1 in-domain evaluation and keeps a similar error reduction trend on zh-C2. - For Full-FT, it can be observed only the tiny and base Whispers show the improvements after adaptation, while the small and medium all made a large performance degradation, especially for the case of 1K utterances. - PEFT gives consistent improvements across different model sizes and different amounts of adaptation data. - In addition, <span class='red'>PEFT on medium (a half size of large) surpasses a lot over those with smaller model sizes, approaching the level of large</span>. ::: ### 4.2 What makes AdaLoRA superior to LoRA? - AdaLoRA always gives significantly better PEFT performance than LoRA in our experiments. - AdaLoRA mainly comes with two modifications to the vanilla one: 1. Orthogonal regularization and learnable rank coefficients via SVD. 2. Importance-aware rank allocation. - To investigate the main reason that makes AdaLoRA superior to LoRA, the following ablation experiments are performed: 1. Remove orthogonal regularization (orth) from AdaLoRA. 2. Remove importance-aware rank allocation (alloc) from AdaLoRA. 3. Remove orth and alloc from AdaLoRA, only learnable rank coefficients left. 4. Add rank coefficients to LoRA, since rank coefficients used for rank allocation can not be removed from AdaLoRA. ![截圖 2024-03-23 01.50.09](https://hackmd.io/_uploads/r1vJSBj06.png =70%x) :::success - By removing orth, alloc and both of them from AdaLoRA, clear performance degradation does not happen, even bringing some performance gains. - This shows that orth and alloc are not the key factors for the performance improvement of AdaLoRA compared to LoRA. ::: - To verify the benefits of learnable rank coefficients, a trainable scalar α was introduced to LoRA, named α-LoRA for experiments where ∆W$_i$ = αB$_i$A$_i$. - For vanilla LoRA, B$_i$ has to be initialized as zero to make sure ∆W$_i$ = 0 at the start. With α = 0, {B$_i$,A$_i$} can be initialized by a normal distribution, similar to the initialization way of AdaLoRA. :::info 這段落描述了一個實驗設置,目的是驗證可學習的階數係數對模型性能的貢獻。具體內容如下: 1. 作者在原始LoRA的基礎上引入了一個可訓練的標量α,稱為α-LoRA。在α-LoRA中,權重矩陣Wi的增量更新ΔWi被參數化為: ΔWi = αBiAi 2. 在vanilla LoRA中,為了確保在訓練初始階段ΔWi = 0,low-rank decomposed矩陣Bi必須被初始化為全0矩陣。 3. 但在α-LoRA中,由於引入了可學習的α,即使Bi和Ai按normal distribution 初始化,一開始也有ΔWi ≈ 0(當α ≈ 0)。 4. 這種初始化方式與AdaLoRA相似,AdaLoRA中的B̄i和Āi也是按照normal distribution初始化的。 5. 通過對比vanilla LoRA和α-LoRA的實驗結果,作者發現α-LoRA大幅優於vanilla LoRA,表現接近AdaLoRA。 6. 這驗證了可學習的階數係數(本例中為α)對模型微調性能的提升作用。 7. 可學習的階數係數賦予了模型自適應調節low-rank投影強度的能力,而vanilla LoRA缺乏這種能力。 8. α-LoRA相當於是一個簡化版的AdaLoRA,省去了複雜的奇異值分解和階數分配策略,但保留了關鍵的可學習階數係數設計。 9. 實驗結果表明,可學習的階數係數是AdaLoRA相比LoRA的主要性能提升因素,而奇異值分解和動態階數分配則是次要的。總之,通過設計α-LoRA對比實驗,作者證實了可學習階數係數對參數高效微調的重要性,也為後續提出S2-LoRA(利用稀疏階數係數向量)獻計獻策。 ::: :::success - α-LoRA largely improves the vanilla LoRA across all ranks, illustrating the learnable rank coefficients are the most important design. ::: ### 4.3 Comparing S2-LoRA with other PEFTs - The proposed S2-LoRA is <span class='red'>inspired by the design of rank coefficients</span> and the <span class='red'>sparse rank distribution of AdaLoRA</span>. ![截圖 2024-03-23 02.03.27](https://hackmd.io/_uploads/BJM-uHiAa.png =80%x) :::success - AdaLoRA with 0.46% trainable parameters achieves the best in-domain performance while GLoRA with 0.81% ones exhibits the best generalization ability. - IA3 with 0.03% trainable parameters performs the worst among PEFT approaches. - Compared to AdaLoRA, the proposed S2-LoRA is competitive on in-domain data and performs better under out-of-domain conditions, while using 20x fewer trainable parameters (0.02%).Notably, <span class='red'>S2-LoRA has trainable parameters even fewer than IA3 and BitFit</span>. ::: :::success For cross-lingual evaluation: - All PEFT methods achieved positive performance gains in Mandarin (zh-C1, zh-C2, and zh-A1), while suffering from increased CERs (around 2-3%) in Cantonese (zh-HK). - This may be due to the conflict in the token modeling of Whisper where the same language ID zh is used for both Mandarin and Cantonese. - Performance improvement in Mandarin would cause a performance reduction in Cantonese. - With different language IDs (en vs. zh), all PEFT methods are found to be able to bring somewhat improvements in English while adapting to Mandarin. ::: :::info 這句話指出,Whisper在標記tokenization建模上,將普通話(Mandarin)和粵語(Cantonese)使用了相同的語言ID "zh",可能導致了一些性能衝突。具體解釋如下: 1. 在自然語言處理和語音識別任務中,通常需要先將輸入序列(文字或語音)轉換為一系列標記(tokens)的表示形式,稱為tokenization。 2. Tokenization的質量對模型的最終性能有很大影響,因為它決定了模型如何理解和建模輸入數據。 3. Whisper作為一個大規模的多語種語音模型,內部會為每種語言分配一個獨立的語言ID,用於區分不同語言的特徵和建模策略。 4. 但是,Whisper將普通話(官方語言為zh-cmn)和粵語(官方語言為zh-yue)這兩種漢語方言,都指定為同一個語言ID "zh"。 5. 這可能是出於資料量的考慮,或者是基於這兩種語言在發音、詞彙等方面的某些相似性。 6. 然而,普通話和粵語在語音語料和語言特徵上仍存在明顯差異,使用相同的語言ID對它們進行統一tokenization和建模可能會帶來衝突。 7. 當在普通話語料上微調Whisper時,模型可能會更多關注普通話的語音模式,而一定程度上忽略了粵語的特徵。 8. 因此,儘管在普通話測試集上性能有所提升,但在粵語測試集上的性能可能會相應下降,因為語言建模存在衝突。 9. 作者認為,這種衝突可能是導致在粵語測試集上性能下降(CER增加約2-3%)的主要原因。 10. 為解決這一問題,可能需要為普通話和粵語分配不同的語言ID,以支持更精細的語言分類和建模。總之,該句指出Whisper在語言建模時對普粵語言使用了相同標記化策略,可能導致了一定的性能衝突,為後續改進提供了啟示。 ::: ![截圖 2024-03-23 02.14.04](https://hackmd.io/_uploads/H1EF5Bo0p.png =80%x) :::success - The rank distribution of S2-LoRA is found to have similar patterns with AdaLoRA. - Each element of the distribution matrix denotes the allocated rank of the incremental update matrix ∆W$_i$. - The elements with brighter colors represent lower ranks allocated, thus being less important. - In the SAM/CAM of both encoder and decoder, the rank allocated to {W$_q$ , W$_k$} are mostly zero, suggesting that they are much less important than {W$_v$ , W$_o$}. - This phenomenon exists in AdaLoRA and S2-LoRA, indicating that S2-LoRA learns to identify important blocks and allocate ranks to them for improving adaptation performance. ::: ## 5. Conclusion - This paper presents a novel S2-LoRA approach to adapting Whisper. - S2-LoRA improves AdaLoRA with much fewer trainable parameters than most existing PEFT methods such as GLoRA, BitFit, etc. - S2-LoRA benefits from the design of the sparse learnable rank coefficients and shared rank-decomposed matrices. - Experiments carried out on low-resource Chinese child speech demonstrate the effectiveness of our proposed approach, showing that S2-LoRA achieves adaptation performance comparable to AdaLoRA and noticeably better cross-domain generalizability while retaining only 0.02% trainable parameters. - Though using child speech as a study case, the proposed <span class='red'>S2-LoRA is general and can benefit adaptation in other low-resource speech recognition scenarios</span>.