[Math][Linear Algebra] 低秩矩陣分解( Low-rank Matrix Decomposition)與 LoRA

### [Deeplearning.ai GenAI/LLM系列課程筆記](https://learn.deeplearning.ai/) --- ## 低秩矩陣分解( Low-rank Matrix Decomposition)與 LoRA ### Low-rank Matrix Decomposition 複習一下線性代數中的低秩近似 Low-Rank Approximation(矩陣的秩分解) <div style="text-align: center;"> <figure> <img src="https://dustinstansbury.github.io/theclevermachine/assets/images/svd-data-compression/low-rank-approximation.png" alt="low-rank-approximation.png" width="300"> <figcaption> <span style="color: #3F7FBF; font-weight: bold;"> <a href="https://arxiv.org/pdf/2106.09685.pdf" target="_blank">Low-rank Matrix Decomposition</a> </figcaption> </figure> </div> - A matrix $M$ of size $m \times n$ and rank $r$ can be decomposed into a pair of matrices $L_k$ and ${R_k}$ 給定一個大小為$m \times n$的矩陣$M$，其秩為$r$，它可以被分解成兩個矩陣$L_k$和$R_k$。 - 完整重建： - 當$k = r$時，矩陣$M$可以從分解中完全重建。這意味著使用$L_k$和$R_k$的乘積，我們可以得到原始矩陣$M$ When ${k = r}$, the matrix $M$ can be exactly reconstructed from the decomposition. - 低秩近似： - 當$k < r$時，分解提供了$M$的低秩近似$\hat{M}$。這意味著使用$L_k$和$R_k$的乘積，我們得到的矩陣$\hat{M}$是$M$的一個近似，但其秩較低 When ${k < r}$, then the decomposition provides a low-rank approximation $\hat{M}$ of $M$ - 秩($r$): - 矩陣的秩$r$代表矩陣的線性獨立列或行的最大數量。 :::info 換句話說，它描述了**矩陣中的信息或數據的維度** 也可以視為一種降維壓縮方式 ::: - 秩為$r$的矩陣表示有$r$個線性獨立的列或行，這意味著矩陣可以用$r$個線性獨立的列或行來表示其餘的列或行 #### 低秩矩陣分解與大型語言模型的LoRA - LoRA的核心思想是利用低秩結構來微調模型，這與低秩矩陣分解的目的相似，**即使用較少的參數來捕獲和壓縮信息** - 在LoRA中，新訓練的權重可以看作是原始權重的低秩近似 #### [2020.08。The Clever Machine。SVD and Data Compression Using Low-rank Matrix Approximation](https://dustinstansbury.github.io/theclevermachine/svd-data-compression) <div style="text-align: center;"> <figure> <img src="https://dustinstansbury.github.io/theclevermachine/assets/images/svd-data-compression/image-singular-values.png" alt="image-singular-values.png" width="800"> <figcaption> <span style="color: #3F7FBF; font-weight: bold;"> <a href="https://arxiv.org/pdf/2106.09685.pdf" target="_blank">Singular Value Decomposition of an image X .</a> </figcaption> </figure> </div> > - 左圖：灰階圖像可以被視為一個矩陣X。 - 注意：rank($r$)=120 > - 中圖：Singular values奇異值（藍色）及其log對數（紅色）作為秩$k$的函數。奇異值隨著秩指數性地減少，早期的奇異值遠大於後來的 > - 右圖：所有奇異值編碼的關於X的總資訊直到k。大部分的信息都被編碼在由SVD返回的第一個奇異向量(singular vectors)中。 > - y軸：累計的資訊保留量(越高代表越接近原圖/矩陣、越低代表壓縮比越高) > - x軸：當$k$逐漸往$r$=120靠近時，壓縮率越低，矩陣$R_k$(壓縮的圖像/近似的矩陣)越接近原本的圖像/矩陣 X ## 補充資料 ### PEFT: Parameter-Efficient Finetuning #### [Scaling Down to Scale Up: A Guide to Porameter-Efficient Fine-Tuning ](https://arxiv.org/abs/2303.15647) #### [A Guide to Parameter-Efficient Fine-Tuning](https://zhuanlan.zhihu.com/p/627537421) 簡體中文說明 #### [【機器學習 2023】(生成式 AI)。Hung-yi Lee。Finetuning vs. Prompting](https://www.youtube.com/watch?v=F58vJcGgjt0) #### [2022。Cheng-Han Chiang,Yung-Sung Chuang, Hung-yi Lee。AACL-IJCNLP。Recent Advances in Pre-trained Language Models:Why Do They Work and How to Use Them](https://d223302.github.io/AACL2022-Pretrain-Language-Model-Tutorial/lecture_material/AACL_2022_tutorial_PLMs.pdf) #### [Finetuning Large Language Models。微調大型語言模型](https://hackmd.io/@YungHuiHsu/HJ6AT8XG6) - [考量與開始(Considerations on getting started now)](https://hackmd.io/@YungHuiHsu/r1KGob8fT)