# Generative AI with Large Language Models - Lecture note
- Link to course: https://www.coursera.org/learn/generative-ai-with-llms
## Introduction
* Overview: Trained on massive datasets, LLMs can mimic human abilities and solve complex tasks.
* Functionality: LLMs use numerous parameters, with a larger number indicating more sophisticated capabilities. They operate using a "context window" to process prompts.
* Interaction: Users interact with LLMs through natural language prompts, generating responses known as completions via inference.
* Applications: LLMs can be customized for specific tasks without retraining from scratch.
* Project lifecycle: building, training, and deploying models.
### LLM use cases and tasks
* Generate a summary based on the file you uploaded.

* Translation task (included human languages and programming languages).

* Information retrieval task

* Active development in LLMs includes connecting them to external data sources or APIs to enhance their real-world interactions and provide up-to-date information.

## Text generation before transformers
### RNN-like models
* Predict the next word base on previous few words.
* Problem: Language is complex, so to successfully predict the next word, model need to see the whole document, not the few words.
* RNN-like models (including LSTMs and GRUs) can only predict based on a limited number of preceding words. If a sentence is too long, the model may forget earlier words, resulting in less accurate predictions.
=> Transformers architecture was born (using attention mechanism).
### Transformers
* Self attention (the word 'book' is strongly connected with 'student' and 'teacher')

#### Architecture
* Include 2 main parts: encoder and decoder

* First, tokenize the input sentence into the numeric token ID (1 input sentence has many token ID).
* Then pass each token ID of the input into embedding layer to transform it to an embedding vector (each token ID has their vector in embedding space).

* Embedding vector of each word/token ID.

* Add positional encoding to preserve the order of the words.

* Then, pass the resulting vector to the multi-headed self-attention layer to
* The self-attention weight is learned during training, reflect the importance of each word.
* The output goes through the feed-forward network in the form of a vector of logits proportional to the probability score for each and every token in the tokenizer dictionary.
* The output of encoder is added to multi-head attention layer with the input of decoder before passing it to feed-forward network.
* Pass the decoded vector to softmax layer to normalized each vector into a probability score (the predicted output has the highest score).
* In seq-to-seq task, the output token goes back to decoder to predict the next word.
## Prompt engineering
### In-context learning (ICL)
#### Zero shot learning
* Give the instruction to the LLM

#### One shot learning
* Give one example to the LLM

#### Few shot learning
* Give many examples to the LLM

## Generative configuration

* Max new tokens: the maximum length of the output.
* Sample top K: select the outputs from k highest probability.
* Sample top P: select the outputs which their probability sum <= p (from high to low).
* Temperature: control the randomness (direct proportion)

* The higher temepature helps you generate text that sounds more creative.
## Pre-training LLMs
### Autoencoding models (Encoder-only models)
* Denoising objective
* Good use cases:
* Sentiment analysis
* Named entity recognition
* Classification
* Example models: BERT, RoBERTa
### Autoregressive models (Decoder-only models)
* Predict the next token based on the previous sequence of tokens
* Good use cases:
* Text generation
* Zero-shot inference (for large model)
* Example models: GPT, BLOOM
### Sequence-to-sequence models (Encoder-decoder models)
* Span corruption: mask random sequence of input tokens, then those mask tokens are replaced with Sentinel token $\text{<X>}$
* Good use cases:
* Translation
* Summarization
* Question Answering
* Example models: T5, BART
### Summarize figure

## Computational challenges
* "CUDA out of memory"
* i.e. to train 1B-params model at 32-bit full precision, we need 24GB GPU RAM.
=> Too much memory
* To reduce memory during training, there is a technique call "quantization".
* Main idea of quantization: reduce the memory required to store the weights of the model by reducing the weights precision from 32-bit floating point numbers (FP32) to lower bit numbers
* e.g. FP32 to 16-bit floating point (FP16/BFLOAT16), or 8-bit integer (INT8).
## Fine-tuning
### Intruction fine-tuning
* Using the prompt to intruct the LLM learn, with pair of example text and completion.
* You can fine-tune single task, e.g. summarize or translate task.

* The process of fine-tuning LLM using prompt engineering

### Problem of fine-tuning single task
* Catastrophic forgetting: fine-tuning a language model for a specific task can lead to excellent results on that task, but it may degrade performance on others.
* How to avoid catastrophic forgetting if you need?
* Fine-tune LLM on multiple tasks at the same time, but it's expensive that requires more data and memory.
* Perform Parameter Efficient Fine-tuning (PEFT) instead of full fine-tuning.
* PEFT preserves the weights of LLMS, it only trains a small number of parameters.
## Evaluation
* Exact match: Accuracy = $\frac{\text{Correct predictions}}{\text{Total predictions}}$
* Problem: in generative model, it can generate the same meaning sentence but not exactly matches with the label.
* ROUGE: compares a predict summary to one (or more) human reference summaries.
* ROUGE-n
* Recall = $\frac{\text{n-gram matches}}{\text{n-gram in reference}}$
* Precision = $\frac{\text{n-gram matches}}{\text{n-gram in predicted output}}$
* F1-score = $2 \times \frac{\text{precision } \times \text{ recall}}{\text{precision } + \text{ recall}}$
* ROUGE-L: using Longest Common Subsequence (LCS)
* Recall = $\frac{\text{LCS(Gen,Ref)}}{\text{n-gram in reference}}$
* Precision = $\frac{\text{LCS(Gen,Ref)}}{\text{n-gram in predicted output}}$
* F1-score = $2 \times \frac{\text{precision } \times \text{ recall}}{\text{precision } + \text{ recall}}$
* Example of LCS(Gen,Ref)
* Reference: It is cold outside.
* Generated: It is very cold outside.
* LCS: "It is", "cold outside"
=> LCS(Gen, Ref) = 2
* BLEU: compares model-generated translation to translated text by human.
* BLEU metric = Avg(precision across range of n-gram sizes)
* BLEU close to 1 means the generation has close meaning to the reference.
* Benchmarks: GLUE, SuperGLUE, HELM, etc.
## Perform Parameter Efficient Fine-tuning (PEFT)
* Full fine-tuning of LLMs is challenging
* Trainable weights
* Lack of memory
* Optimizer states
* Gradients
* Forward activations
* Temp memory
* Catastrophic forgetting with a specific task
* PEFT
* LLM with most layers are frozen with small trainable layers
* Or LLM with new trainable layers (base layers are frozen)
* Less prone to catastrophic forgetting
* Flexible: easily swap the task for inference (because they use new trainable layers for each task)
### Methods
* Selective: fine-tune only a subset of the original LLM parameters.
* Reparameterize: reduce the number of parameters to train by creating new low rank transformations
* e.g. LoRA
* Additive: add new trainble layers or parameters
* Adapters: add typically inside the encoder or decoder, after the attention or feed-forward layers
* Soft prompts: keep the model architecture frozen, focus on manipulating the input
### Low-Rank Adaptation of LLMs (LoRA)
* General idea: train 2 matrices that their multiplication is the LLM weights

* Example:
* The Transformer wieghts have dimensions $d \times k = 512 \times 64$, that means we have $512 \times 64 = 32768$ trainable parameters.
* With LoRA, rank r = 8:
* The matrix A has dimensions $r \times k = 8 \times 64 = 512$ parameters.
* The matrix B has dimensions $d \times r = 512 \times 8 = 4096$ parameters.
* That help reduce 86% parameters to train the model
* In LLMs, we can use different matrix to switch the task for the model

* Drawback: using LoRA make the model has lower score (e.g. ROUGE) than full fine-tune, but it is just a small different.
### Soft prompts
* **Prompt tuning is not prompt engineering!!**
* Adding additional trainable tokens before the prompt
* Soft prompt: the set of additional trainable tokens, which vectors have the same length as the embedding vectors
* Like LoRA, to train another task you can switch the soft prompt at inference time

* Prompt tuning is effective with full fine-tuning a large models (above 10M params)
### Summary

## Reinforcement learning from human feedback (RLHF)
* Bad behaviors of LLMs:
* Toxic languages
* Aggressive response
* Dangerous information
=> Helpful? Honest? Harmless? (HHH)
=> Align model with human feedback to increase the helpfulness, honesty and harmlessness
* Using RL to fine-tune model from human feedback
### How it works?
* The progress is based on traditional RL

* Objective can be generate helpful text, or non-toxic content
* The reward $r_t$: based on how closely the completions align with human preferences
* Using the human feedback is time consuming and costly
=> We can use an additional model to evaluate the degree of alignment with human preferences (in practical)
### Reward models
#### Repair dataset
* **This is the most important step to build reward model**
* Collect data from human feedback, label the completions with ranking or just label that is failed if the completions have wrong objective.
* Convert ranking into pairwise training data
* e.g. The completion $y_i$ has rank 2nd and $y_j$ has rank 3rd has the reward vector of {$y_i, y_j$} is [1,0] (which has higher rank is 1)
#### Train model
* Simple training flow

* The model can be used as binary classifier to provide the set of logics across the positive and negative classes.
* The logit value of positive class is a reward value of RLHF
### Fine-tuning LLM with RL using reward model
* Example of fine-tuning LLM with RLHF loop

* The reward is increased to some level to make the instruct LLM become a Human-aligned LLM
* Popular RL algorithm is Proximal Policy Optimization (PPO)
#### PPO
* Inludes 2 phases:
* Phase 1: Create completions
* Calculate the reward values and the loss
* Phase 2: Update model
* Keep the model update within a trust region
* Find the policy whose expected reward is high
### Reward hacking
* The agent learns to cheat by favoring actions that maximize the reward (even if those actions don't align well with the original objective)
* Reduce overall quality
* i.e. The model use some sequence that increase the reward value, but it's useless
#### Avoid reward hacking
* Use instruct LLM (known as performance reference model) to prevent this
* Maintain a single reference model => freeze those model weights
* The flow after using reference model

* In this figure, we only update the weights of PEFT adapter
* We can reuse the same for reference model and PPO model
### Scaling human feedback
* The labeled dataset required to train the model is costly due to the need for highly labelers
=> This is an active research area
#### Constitutional AI: scale supervision
* Using a set of rules & principles
=> Govern model's behavior
* e.g. Control the harmlessness of the completion when the LLM is only trained for helpful objective
* Train the model in 2 distinct stages:
* Supervised learning:
* Model try to generate harmful (called red teaming)
* Then, ask the model to its own reponses according to the constitutional principles and revise them
* Example problem: 
* Reinforcement learning: similar to RLHF (replace human feedback with model's feedback)
* Full process:

## Optimizations for deployment
* Common approach: reduce the size of LLM
* So how to maintain the performance?
### Distillation
* Use a larger model (teacher model) to train the smaller model (student model)
* Use the teacher model to generate completion for training data (labels)
* At that time, use the student model to generate from training data.
* The loss function between labels of teacher model and predictions of student model is distillation loss (smaller is better)
* Full process

* T > 1 make the softmax distribution becomes broader and less strongly peaked
* So this provides a set of tokens that are similar to the ground truth tokens
### Quantization
* Transfer the weight into the lower precision representation (post-training, called PTQ)
* Tradeoffs: sometimes quantization results in a small percentage reduction in model evaluation metrics
### Pruning
* Remove redundant parameters that contribute little to the performance (values close or equal to 0)
* Methods
* Full model re-training
* LoRA
* Post-training
## Cheat Sheet

## Applications
### Some difficulty
* Out of date (retraining to update to reach the lastest date is expensive)
* Math: some simple calculations may be wrong
* Hallucination
### LLM-powered applications

### RAG
* Framework to build LLM powered systems that make use of external data sources and application.
* A great way to overcome the knowledge cutoff issue
#### How it works?
* Retriever: includes:
* Query encoder: your input prompt
* External information sources: the external knowledge, it can be documents, wiki, web pages, db, vector store, etc.
* Train to find what documents in external data are the most relevant/useful to the input query
* Then put it into the LLM
* Figure

#### Data preparation
##### Vector store
* 2 considerations:
* Data must fit inside context window (prompt context limit), so if it too long we can split (can use Langchain)
* Data must be in format that allows its relevance to be assessed at inference time: **Embedding vectors**
#### Interacting with external applications
* Trigger API call (e.g. Python interpreter)
* Perform calculations
* Requirements
* Plan action (step 1 do sth, step 2 do sth, etc.)
* Format outputs (can be a specific structure or SQL query)
* Validate actions: collect information to validate (e.g. verify email, phone number)
### Improving reason and plan with chain-of-thought (CoT)
* LLMs can struggle with complex reasoning problems
* CoT: explain the problem step-by-step with explanation for LLM (can use one-shot or few-shot example)
### Program-aided language models (PAL)
* Pair a LLM with a code interpreter
* Basic idea: The LLM will generate completions where reasoning steps are accompanied by computer code.
* Process (similar to LLM-powered applications):

* Example:

* In "answer", each line is python code, write with CoT format
### ReAct
* Combining CoT reasoning and action in LLMs (LLM + Websearch API)
* Basic structure
* Question: the complex problem requires reasoning and multiple steps
* Thought: a reasoning step that specifies how the model will approach the issue and decide on a course of action
* Action: an external task the model can carry out
* Observation: the result
* The model repeats (3 last steps) many times to obtain the final answer
* Build up the ReAct prompt

### Architectures
