# LLM Finetuning ## Agenda 1. **Finetuning Background** - why not full finetune? - why not in-context learning? - why Paramter-efficient Fine-tuning(peft) 2. **Three subsets of peft** - Parameter composition - Input composition - Function composition 3. **Lora - Low Rank Adaptation** 4. **QLora - Quantitized with Lora** 5. **Human Preference Fine Tuning** - Why RLHF? - How to produce RLHF dataset? - PPO - DPO 6. Simple Demo ## Finetuning Background ### Why not full finetuning? 1. With increase model size, fine-tuning cost grows increasingly. ![](https://hackmd.io/_uploads/ByQuvhhTh.png) 2. **Catastrophic Interference** - Catastrophic interference, also known as catastrophic forgetting, is the tendency of an artificial neural network to abruptly and drastically forget previously learned information upon learning new information - https://en.wikipedia.org/wiki/Catastrophic_interference ### Why not in-context learning? 1. Inefficiency: The prompt needs to be processed every time the model makes a prediction. 2. Poor Performance: Prompting generally performs worse than fine-tuning. 3. Sensitivity: the way we prompt or the order of examples will influence model performance. 4. Lack of clarity: what is the real good prompt format? ### Why Parameter-efficient Fine-tuning - ![](https://hackmd.io/_uploads/rkcl9hnah.png) 1. Fine-tuning all parameters is impractical with large models. 2. SOTA models are massively over-parameterized. 3. **Modularity** - Each Peft as a module, depends on different tasks we can utilize different modules. ![](https://hackmd.io/_uploads/HyQn63han.png) - Fit into Unseen Scenarios and updating of models through added components. <img src="https://hackmd.io/_uploads/BkJgRh2a3.png" alt="drawing" width="300"/> ## Three subset of peft ### Parameter Composition #### 1. Sparse Subnetworks - A common inductive bias on the module parameters is **sparsity** - Common sparsity method: Pruning ![](https://hackmd.io/_uploads/r1zPW6362.png) - During pruning, a fraction of the lowest-magnitude weights are removed(setting weight to 0). - The non-pruned weights are re-trained. - **Magnitude pruning:** remove weights by setting a threshold. - **Movement Pruning:** remove weights by finding out the weights of neurons which are decreasing. - **Diff Pruning:** Limit the amount of weights which are adapted during finetuning of a pretrained model. Only store the difference of weights compared to their pre-trained state. #### 2. Sturctured Composition - Only modify the weights that are associated with pre-defined group. - Most common setting: each group G corresponds to a layer; only update the parameters associated with certain layers ![](https://hackmd.io/_uploads/HJMj7T2ah.png) #### 3. Low-rank Composition - Inductive bias: module parameters should lie in a low-dimensional space. ![](https://hackmd.io/_uploads/SkBoSan62.jpg) - **Intrinsic Dimensionality:** The intrinsic dimension for a data set can be thought of as the number of variables needed in a minimal representation of the data. - We will introduce LoRA in the following sections. ### Input Composition - Augment a model's input by augmenting it with a learnable parameter vector. #### Prompt Tuning - Added a soft prompt(continuous vector) in the beginning of each prompt and tune the model to change the parameter of these soft prompt: ![](https://hackmd.io/_uploads/ByskyC2p3.jpg) - However, prompt tuning only works well at scale: ![](https://hackmd.io/_uploads/BkFakAhT2.png) ### Function Compostion - Function composition augments a model's functions with new task-specific functions. - Main purpose of a new function added to a pre-trained model is to adapt it(known as adapters). - Design of adapters is model-specific. - The adapter is usually placed after the multi-head attention or after the feed-forward layer: <img src="https://hackmd.io/_uploads/Syyu-Chph.png" alt="drawing" width="300"/> ## Lora - Low Rank Adaptation ### Lora Finetuning Explained: ![](https://hackmd.io/_uploads/HynHQChah.png) - The technique constrains the rank of the update matrix ΔW using its **rank decomposition**. It represents ΔWₙₖ as the product of 2 low-rank matrices Bₙᵣ and Aᵣₖ where r << min(n, k). This implies that the forward pass of the layer, **originally Wx, is modified to Wx + BAx**. ![](https://hackmd.io/_uploads/B1GFNC3T3.png) - **Advantages of LoRA:** - **Reduction of training time and space:** Using the technique shown above, r(n + k) parameters have to be tuned during model adaption. Since r << min(n, k), this is much lesser than the number of parameters that would have to be tuned otherwise (nk). This reduces the time and space required to finetune the model by a large margin. - **No additional inference time:** If used in production, we can explicitly compute W’ = W + BA and store the results, performing inference as usual. This guarantees that we do not introduce any additional latency during inference. - **Easier task switching:** Swapping only the LoRA weights as opposed to all the parameters allows cheaper and faster switching between tasks. Multiple customized models can be created and swapped in and out easily. ## QLoRA - Quantitized with Lora - **Why QLoRA?** 1. While LoRA helps in reducing the storage requirements, you would still need a large GPU to load the model into the memory for LoRa training. This is where QLoRA, or Quantized LoRA, comes into the picture. QLoRA is a combination of LoRA and Quantization. 2. **QLoRA allowed us to finetune the model in int4 precision.** - **Technique:** 1. The original pre-trained weights of the model are quantized to 4-bit and kept fixed during fine-tuning. 2. a small number of trainable parameters in the form of low-rank adapters are introduced during fine-tuning. These adapters are trained to adapt the pre-trained model to the specific task it is being fine-tuned for, in 32-bit floating-point format. 3. When it comes to computations (like forward and backward passes during training, or inference), the 4-bit quantized weights are dequantized back to 32-bit floating-point numbers. 4. After the fine-tuning process, the model consists of the original weights in 4-bit form, and the additional low-rank adapters in their higher precision format. - **Fine tuning via Q Lora illustrattion:** 1. Forward pass ![](https://hackmd.io/_uploads/rJmzu03a2.png) 2. De-quantization ![](https://hackmd.io/_uploads/SJUGt0hah.png) 3. Backpropagation - ONLY the 32-bit LoRA weight tensor has been updated and stored in memory. - Performance: ![](https://hackmd.io/_uploads/BkhOFAh6h.png) ## RLHF - Reinforcement Learning from Human Feedback ### Why RLHF - Sometimes, model behaves badly. ![](https://hackmd.io/_uploads/rkIfNiUR3.png) - We want our models can have these 3 elements: **Helpful, Honest and Harmless.** Therefore, we intend to finetune model from human feedback, in order to make it inference qualified content. ![](https://hackmd.io/_uploads/ryCyHsUCh.png) ### How to produce RLHF dataset - We will need a **prompt dataset** and an **instruct LLM.** Then Instruct LLM performs completion task: ![](https://hackmd.io/_uploads/rkOoIjI0h.png) - From the outcome of completion task, we define our criterion(ranking) by our preference: ![](https://hackmd.io/_uploads/rkhZwo8A2.png) - Simple rule of human ranking: - Rank the responses according to which one **provides the best answer to the input prompt**. - What is the best answer? Make a decision based on **(a) the correctness of the answer**, and **(b) the informativeness of the response**. For (a) you are allowed to search the web. Overall, use your best judgment to rank answers based on being the most useful response, which we define as one which is at least somewhat correct, and minimally informative about what the prompt is asking for. - If two responses provide the same correctness and informativeness by your judgment, and **there is no clear winner, you may rank them the same**, but please only use this sparingly. - If the answer for a given response is **nonsensical, irrelevant, highly ungrammatical/confusing**, or does not clearly respond to the given prompt, label it with "F" (for fail) rather than its rank. - **Long answers are not always the best.** Answers which provide succinct, coherent responses may be better than longer ones, if they are at least as correct and informative. - Convert Ranking into pairwise dataset: ![](https://hackmd.io/_uploads/Hk9nOsLRh.png) It will be a pair of labeled good example and bad example. We will use pairwise dataset to train the reward model(PPO). - RLHF dataset - [HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf/viewer/default/train?row=2) ### PPO(Proximal Policy Optimization) - Flow of PPO RLHF: 1. **Pre-training a Language Model (LM):** - Start by pre-training an LM using standard techniques. - Different organizations like OpenAI, Anthropic, and DeepMind have used various-sized models for this. - Some tweaks can be made using additional data, but this is optional. There's no clear consensus on the best starting model for RLHF. 2. **Training a Reward Model (RM):** - This step distinguishes RLHF from traditional training methods. - The RM(reward model) assesses the quality of text generated by the LM, giving it a "reward" based on human preferences. - The training data for the RM includes prompt-response pairs. For example, Anthropic uses data from Amazon Mechanical Turk chats. ![](https://hackmd.io/_uploads/HkPkpjLC2.png) ![](https://hackmd.io/_uploads/ryWWpjICh.png) - Human evaluators rank multiple responses by quality. Ranking (rather than direct scoring) is chosen because it's more consistent and less noisy. - Some organizations use different-sized models for the LM and RM, implying that the understanding capability should be similar. 3. **Fine-tuning with Reinforcement Learning (RL):** ![](https://hackmd.io/_uploads/SJWOniIR3.png) - Historically, RL was thought difficult for training large LMs due to technical challenges. - The solution found by many is to use Proximal Policy Optimization (PPO) to tweak the LM parameters. - The LM is considered as a "policy" in RL terms: it takes a prompt (observation) and returns a text (action). - The reward function is based on feedback from the RM. Additionally, a penalty is applied if the RL-tweaked model's output deviates too much from the original LM to ensure consistency. - PPO algorithm: ![](https://hackmd.io/_uploads/Sk0kyn8Rn.png) ### DPO(Direct Preference Optimization) - Why not PPO? ![](https://hackmd.io/_uploads/S13AG2IA3.png) - Direct preference optimization **skips the process to train a reward model.** - This optimization offers a simple loss function and less computationally expensive than RLHF method. - How does DPO work? - In the DPO’s paper, the authors apply the Bradley and Terry model, which is a preference model in the loss function. - they demonstrate that the second step can be skipped because language models inherently **act as reward models** themselves. - Once the second step is removed, the problem is significantly simplified to an optimization problem with a cross-entropy objective: ![](https://hackmd.io/_uploads/BkctV2ICh.png) - Result: ![](https://hackmd.io/_uploads/B1y1Bh80n.png) - **On the left,** **DPO framework is more efficient than that from PPO** because all the points of DPO (yellow) are located higher than those of PPO (orange). This plot represents the achieved reward and KL divergence in a sentiment generation task. - **On the right,** they demonstrate that the win rate in the summarization task of **DPO surpasses that of PPO across all temperature variations (more imagination).** #### Source: [LoRA Explain Article](https://www.ml6.eu/blogpost/low-rank-adaptation-a-technical-deep-dive) [PEFT Explain](https://www.youtube.com/watch?v=KoOlcX3XLd4) [LoRA Explain](https://www.youtube.com/watch?v=dA-NhCtrrVE) [QLoRA Tutorial](https://www.youtube.com/watch?v=TPcXVJ1VSRI) [Understandng LoRA and QLoRA](https://medium.com/@gitlostmurali/understanding-lora-and-qlora-the-powerhouses-of-efficient-finetuning-in-large-language-models-7ac1adf6c0cf) [Instruction Tune](https://www.philschmid.de/instruction-tune-llama-2) [Lora from Scratch](https://medium.com/@alexmriggio/lora-low-rank-adaptation-from-scratch-code-and-theory-f31509106650) [PPO RLHF Tutorial with workshop](https://www.youtube.com/watch?v=-0pvrCLd2Ak&t=3128s) [PPO RLHF 技术详解](https://mp.weixin.qq.com/s/TLQ3TdrB5gLb697AFmjEYQ) [PPO理論理解 by李弘毅](https://www.youtube.com/watch?v=OAKAZhFmYoI) [DPO Explanation](https://pakhapoomsarapat.medium.com/forget-rlhf-because-dpo-is-what-you-actually-need-f10ce82c9b95) [DPO Training Repo](https://github.com/huggingface/trl/blob/main/examples/research_projects/stack_llama_2/scripts/dpo_llama2.py) [DPO simple explain video](https://www.youtube.com/watch?v=pzh2oc6shic&t=1s) #### Reference: https://en.wikipedia.org/wiki/Catastrophic_interference