# Generative AI with Large Language Models - Lecture note - Link to course: https://www.coursera.org/learn/generative-ai-with-llms ## Introduction * Overview: Trained on massive datasets, LLMs can mimic human abilities and solve complex tasks. * Functionality: LLMs use numerous parameters, with a larger number indicating more sophisticated capabilities. They operate using a "context window" to process prompts. * Interaction: Users interact with LLMs through natural language prompts, generating responses known as completions via inference. * Applications: LLMs can be customized for specific tasks without retraining from scratch. * Project lifecycle: building, training, and deploying models. ### LLM use cases and tasks * Generate a summary based on the file you uploaded. ![Screenshot 2024-05-14 193105](https://hackmd.io/_uploads/rkf5Y0emR.png) * Translation task (included human languages and programming languages). ![Screenshot 2024-05-14 193222](https://hackmd.io/_uploads/ByPyqCxQ0.png) * Information retrieval task ![Screenshot 2024-05-14 193343](https://hackmd.io/_uploads/BJJ49ClQA.png) * Active development in LLMs includes connecting them to external data sources or APIs to enhance their real-world interactions and provide up-to-date information. ![Screenshot 2024-05-14 194405](https://hackmd.io/_uploads/Sybih0g70.png) ## Text generation before transformers ### RNN-like models * Predict the next word base on previous few words. * Problem: Language is complex, so to successfully predict the next word, model need to see the whole document, not the few words. * RNN-like models (including LSTMs and GRUs) can only predict based on a limited number of preceding words. If a sentence is too long, the model may forget earlier words, resulting in less accurate predictions. => Transformers architecture was born (using attention mechanism). ### Transformers * Self attention (the word 'book' is strongly connected with 'student' and 'teacher') ![Screenshot 2024-05-14 195357](https://hackmd.io/_uploads/rJ3-J1b7A.png) #### Architecture * Include 2 main parts: encoder and decoder ![Screenshot 2024-05-14 195647](https://hackmd.io/_uploads/BJ_9JyWX0.png) * First, tokenize the input sentence into the numeric token ID (1 input sentence has many token ID). * Then pass each token ID of the input into embedding layer to transform it to an embedding vector (each token ID has their vector in embedding space). ![Screenshot 2024-05-14 200454](https://hackmd.io/_uploads/ry6_ZJWmA.png) * Embedding vector of each word/token ID. ![Screenshot 2024-05-14 200644](https://hackmd.io/_uploads/BJCyMJ-QA.png) * Add positional encoding to preserve the order of the words. ![Screenshot 2024-05-14 200630](https://hackmd.io/_uploads/S1CC-JZmR.png) * Then, pass the resulting vector to the multi-headed self-attention layer to * The self-attention weight is learned during training, reflect the importance of each word. * The output goes through the feed-forward network in the form of a vector of logits proportional to the probability score for each and every token in the tokenizer dictionary. * The output of encoder is added to multi-head attention layer with the input of decoder before passing it to feed-forward network. * Pass the decoded vector to softmax layer to normalized each vector into a probability score (the predicted output has the highest score). * In seq-to-seq task, the output token goes back to decoder to predict the next word. ## Prompt engineering ### In-context learning (ICL) #### Zero shot learning * Give the instruction to the LLM ![Screenshot 2024-05-16 152600](https://hackmd.io/_uploads/rJ8mXHmmA.png) #### One shot learning * Give one example to the LLM ![Screenshot 2024-05-16 152800](https://hackmd.io/_uploads/BJDq7HXQA.png) #### Few shot learning * Give many examples to the LLM ![Screenshot 2024-05-16 152842](https://hackmd.io/_uploads/SJb6XBQmR.png) ## Generative configuration ![Screenshot 2024-05-16 154632](https://hackmd.io/_uploads/BJrZ_SmXC.png) * Max new tokens: the maximum length of the output. * Sample top K: select the outputs from k highest probability. * Sample top P: select the outputs which their probability sum <= p (from high to low). * Temperature: control the randomness (direct proportion) ![Screenshot 2024-05-16 155223](https://hackmd.io/_uploads/ByyLtrQ7R.png) * The higher temepature helps you generate text that sounds more creative. ## Pre-training LLMs ### Autoencoding models (Encoder-only models) * Denoising objective * Good use cases: * Sentiment analysis * Named entity recognition * Classification * Example models: BERT, RoBERTa ### Autoregressive models (Decoder-only models) * Predict the next token based on the previous sequence of tokens * Good use cases: * Text generation * Zero-shot inference (for large model) * Example models: GPT, BLOOM ### Sequence-to-sequence models (Encoder-decoder models) * Span corruption: mask random sequence of input tokens, then those mask tokens are replaced with Sentinel token $\text{<X>}$ * Good use cases: * Translation * Summarization * Question Answering * Example models: T5, BART ### Summarize figure ![Screenshot 2024-05-17 144356](https://hackmd.io/_uploads/HJmpcK4QC.png) ## Computational challenges * "CUDA out of memory" * i.e. to train 1B-params model at 32-bit full precision, we need 24GB GPU RAM. => Too much memory * To reduce memory during training, there is a technique call "quantization". * Main idea of quantization: reduce the memory required to store the weights of the model by reducing the weights precision from 32-bit floating point numbers (FP32) to lower bit numbers * e.g. FP32 to 16-bit floating point (FP16/BFLOAT16), or 8-bit integer (INT8). ## Fine-tuning ### Intruction fine-tuning * Using the prompt to intruct the LLM learn, with pair of example text and completion. * You can fine-tune single task, e.g. summarize or translate task. ![Screenshot 2024-05-19 131044](https://hackmd.io/_uploads/rJelxuGPQ0.png) * The process of fine-tuning LLM using prompt engineering ![Screenshot 2024-05-19 131316](https://hackmd.io/_uploads/B1ztOMPQR.png) ### Problem of fine-tuning single task * Catastrophic forgetting: fine-tuning a language model for a specific task can lead to excellent results on that task, but it may degrade performance on others. * How to avoid catastrophic forgetting if you need? * Fine-tune LLM on multiple tasks at the same time, but it's expensive that requires more data and memory. * Perform Parameter Efficient Fine-tuning (PEFT) instead of full fine-tuning. * PEFT preserves the weights of LLMS, it only trains a small number of parameters. ## Evaluation * Exact match: Accuracy = $\frac{\text{Correct predictions}}{\text{Total predictions}}$ * Problem: in generative model, it can generate the same meaning sentence but not exactly matches with the label. * ROUGE: compares a predict summary to one (or more) human reference summaries. * ROUGE-n * Recall = $\frac{\text{n-gram matches}}{\text{n-gram in reference}}$ * Precision = $\frac{\text{n-gram matches}}{\text{n-gram in predicted output}}$ * F1-score = $2 \times \frac{\text{precision } \times \text{ recall}}{\text{precision } + \text{ recall}}$ * ROUGE-L: using Longest Common Subsequence (LCS) * Recall = $\frac{\text{LCS(Gen,Ref)}}{\text{n-gram in reference}}$ * Precision = $\frac{\text{LCS(Gen,Ref)}}{\text{n-gram in predicted output}}$ * F1-score = $2 \times \frac{\text{precision } \times \text{ recall}}{\text{precision } + \text{ recall}}$ * Example of LCS(Gen,Ref) * Reference: It is cold outside. * Generated: It is very cold outside. * LCS: "It is", "cold outside" => LCS(Gen, Ref) = 2 * BLEU: compares model-generated translation to translated text by human. * BLEU metric = Avg(precision across range of n-gram sizes) * BLEU close to 1 means the generation has close meaning to the reference. * Benchmarks: GLUE, SuperGLUE, HELM, etc. ## Perform Parameter Efficient Fine-tuning (PEFT) * Full fine-tuning of LLMs is challenging * Trainable weights * Lack of memory * Optimizer states * Gradients * Forward activations * Temp memory * Catastrophic forgetting with a specific task * PEFT * LLM with most layers are frozen with small trainable layers * Or LLM with new trainable layers (base layers are frozen) * Less prone to catastrophic forgetting * Flexible: easily swap the task for inference (because they use new trainable layers for each task) ### Methods * Selective: fine-tune only a subset of the original LLM parameters. * Reparameterize: reduce the number of parameters to train by creating new low rank transformations * e.g. LoRA * Additive: add new trainble layers or parameters * Adapters: add typically inside the encoder or decoder, after the attention or feed-forward layers * Soft prompts: keep the model architecture frozen, focus on manipulating the input ### Low-Rank Adaptation of LLMs (LoRA) * General idea: train 2 matrices that their multiplication is the LLM weights ![Screenshot 2024-05-23 212415](https://hackmd.io/_uploads/Hyw9-03XA.png) * Example: * The Transformer wieghts have dimensions $d \times k = 512 \times 64$, that means we have $512 \times 64 = 32768$ trainable parameters. * With LoRA, rank r = 8: * The matrix A has dimensions $r \times k = 8 \times 64 = 512$ parameters. * The matrix B has dimensions $d \times r = 512 \times 8 = 4096$ parameters. * That help reduce 86% parameters to train the model * In LLMs, we can use different matrix to switch the task for the model ![Screenshot 2024-05-23 212957](https://hackmd.io/_uploads/S1jkQAhQR.png) * Drawback: using LoRA make the model has lower score (e.g. ROUGE) than full fine-tune, but it is just a small different. ### Soft prompts * **Prompt tuning is not prompt engineering!!** * Adding additional trainable tokens before the prompt * Soft prompt: the set of additional trainable tokens, which vectors have the same length as the embedding vectors * Like LoRA, to train another task you can switch the soft prompt at inference time ![Screenshot 2024-05-30 132423](https://hackmd.io/_uploads/HkC5jcHER.png) * Prompt tuning is effective with full fine-tuning a large models (above 10M params) ### Summary ![Screenshot 2024-05-30 133018](https://hackmd.io/_uploads/rkxW-T9rNR.png) ## Reinforcement learning from human feedback (RLHF) * Bad behaviors of LLMs: * Toxic languages * Aggressive response * Dangerous information => Helpful? Honest? Harmless? (HHH) => Align model with human feedback to increase the helpfulness, honesty and harmlessness * Using RL to fine-tune model from human feedback ### How it works? * The progress is based on traditional RL ![Screenshot 2024-06-03 152824](https://hackmd.io/_uploads/SkV3AloV0.png) * Objective can be generate helpful text, or non-toxic content * The reward $r_t$: based on how closely the completions align with human preferences * Using the human feedback is time consuming and costly => We can use an additional model to evaluate the degree of alignment with human preferences (in practical) ### Reward models #### Repair dataset * **This is the most important step to build reward model** * Collect data from human feedback, label the completions with ranking or just label that is failed if the completions have wrong objective. * Convert ranking into pairwise training data * e.g. The completion $y_i$ has rank 2nd and $y_j$ has rank 3rd has the reward vector of {$y_i, y_j$} is [1,0] (which has higher rank is 1) #### Train model * Simple training flow ![Screenshot 2024-06-03 160301](https://hackmd.io/_uploads/HJn6UWjVC.png) * The model can be used as binary classifier to provide the set of logics across the positive and negative classes. * The logit value of positive class is a reward value of RLHF ### Fine-tuning LLM with RL using reward model * Example of fine-tuning LLM with RLHF loop ![Screenshot 2024-06-03 161608](https://hackmd.io/_uploads/ByVx5biV0.png) * The reward is increased to some level to make the instruct LLM become a Human-aligned LLM * Popular RL algorithm is Proximal Policy Optimization (PPO) #### PPO * Inludes 2 phases: * Phase 1: Create completions * Calculate the reward values and the loss * Phase 2: Update model * Keep the model update within a trust region * Find the policy whose expected reward is high ### Reward hacking * The agent learns to cheat by favoring actions that maximize the reward (even if those actions don't align well with the original objective) * Reduce overall quality * i.e. The model use some sequence that increase the reward value, but it's useless #### Avoid reward hacking * Use instruct LLM (known as performance reference model) to prevent this * Maintain a single reference model => freeze those model weights * The flow after using reference model ![Screenshot 2024-06-04 212540](https://hackmd.io/_uploads/SJ1gVj2VA.png) * In this figure, we only update the weights of PEFT adapter * We can reuse the same for reference model and PPO model ### Scaling human feedback * The labeled dataset required to train the model is costly due to the need for highly labelers => This is an active research area #### Constitutional AI: scale supervision * Using a set of rules & principles => Govern model's behavior * e.g. Control the harmlessness of the completion when the LLM is only trained for helpful objective * Train the model in 2 distinct stages: * Supervised learning: * Model try to generate harmful (called red teaming) * Then, ask the model to its own reponses according to the constitutional principles and revise them * Example problem: ![Screenshot 2024-06-05 210706](https://hackmd.io/_uploads/BkwzZlANC.png) * Reinforcement learning: similar to RLHF (replace human feedback with model's feedback) * Full process: ![Screenshot 2024-06-05 211325](https://hackmd.io/_uploads/Bk3tMgAN0.png) ## Optimizations for deployment * Common approach: reduce the size of LLM * So how to maintain the performance? ### Distillation * Use a larger model (teacher model) to train the smaller model (student model) * Use the teacher model to generate completion for training data (labels) * At that time, use the student model to generate from training data. * The loss function between labels of teacher model and predictions of student model is distillation loss (smaller is better) * Full process ![Screenshot 2024-06-08 205230](https://hackmd.io/_uploads/r12QGkGHR.png) * T > 1 make the softmax distribution becomes broader and less strongly peaked * So this provides a set of tokens that are similar to the ground truth tokens ### Quantization * Transfer the weight into the lower precision representation (post-training, called PTQ) * Tradeoffs: sometimes quantization results in a small percentage reduction in model evaluation metrics ### Pruning * Remove redundant parameters that contribute little to the performance (values close or equal to 0) * Methods * Full model re-training * LoRA * Post-training ## Cheat Sheet ![Screenshot 2024-06-08 205909](https://hackmd.io/_uploads/rJ8nQ1GHC.png) ## Applications ### Some difficulty * Out of date (retraining to update to reach the lastest date is expensive) * Math: some simple calculations may be wrong * Hallucination ### LLM-powered applications ![Screenshot 2024-06-08 210407](https://hackmd.io/_uploads/HkEkSJzBR.png) ### RAG * Framework to build LLM powered systems that make use of external data sources and application. * A great way to overcome the knowledge cutoff issue #### How it works? * Retriever: includes: * Query encoder: your input prompt * External information sources: the external knowledge, it can be documents, wiki, web pages, db, vector store, etc. * Train to find what documents in external data are the most relevant/useful to the input query * Then put it into the LLM * Figure ![Screenshot 2024-06-08 211134](https://hackmd.io/_uploads/BklsL1fS0.png) #### Data preparation ##### Vector store * 2 considerations: * Data must fit inside context window (prompt context limit), so if it too long we can split (can use Langchain) * Data must be in format that allows its relevance to be assessed at inference time: **Embedding vectors** #### Interacting with external applications * Trigger API call (e.g. Python interpreter) * Perform calculations * Requirements * Plan action (step 1 do sth, step 2 do sth, etc.) * Format outputs (can be a specific structure or SQL query) * Validate actions: collect information to validate (e.g. verify email, phone number) ### Improving reason and plan with chain-of-thought (CoT) * LLMs can struggle with complex reasoning problems * CoT: explain the problem step-by-step with explanation for LLM (can use one-shot or few-shot example) ### Program-aided language models (PAL) * Pair a LLM with a code interpreter * Basic idea: The LLM will generate completions where reasoning steps are accompanied by computer code. * Process (similar to LLM-powered applications): ![Screenshot 2024-06-08 213904](https://hackmd.io/_uploads/SyympyfHR.png) * Example: ![Screenshot 2024-06-08 213703](https://hackmd.io/_uploads/SJ1snyGr0.png) * In "answer", each line is python code, write with CoT format ### ReAct * Combining CoT reasoning and action in LLMs (LLM + Websearch API) * Basic structure * Question: the complex problem requires reasoning and multiple steps * Thought: a reasoning step that specifies how the model will approach the issue and decide on a course of action * Action: an external task the model can carry out * Observation: the result * The model repeats (3 last steps) many times to obtain the final answer * Build up the ReAct prompt ![Screenshot 2024-06-08 215220](https://hackmd.io/_uploads/HJRQelMB0.png) ### Architectures ![Screenshot 2024-06-08 215358](https://hackmd.io/_uploads/By19lgzSC.png)