LLMs amd LDMs - HackMD

![](https://hackmd.io/_uploads/SkKAtkKEh.png) # Large Language Models and Large Dialogue Models (`LLMs` and `LDMs`) ---  ## Overview of the talk: 1. Pre-training paradigms. 2. Shift from masked language modeling to next-word prediction. 3. Model types. 4. Overview of `transformer` architecture. 5. Examples of evaluation benchmarks. 6. Datasets for pre-training. 7. Adaptation and fine-tuning. 8. Instruction following and dialogue like capabilities. ---  ## 1. Pre-taining paradigms: * `MLM` - Masked Language Modeling predicts a masked token in a sequence, and the model can attend to tokens bidirectionally. This means the model has full access to the tokens on the left and right. Masked language modeling is great for tasks that require a good contextual understanding of an entire sequence. `BERT` is an example of a masked language model. * `next-word-prediction` - this should be self explanatory: a model is trained to predict the next word given a some sequence of words. `GPT2` and `GPT3` both are good examples of this type of pre-training. ---  ## 2. Paradigm shift * For quite some (since `BERT` came out - 2018) time `NLP` tasks were usually solved by: * taking an `MLM`-pre-trained model as a backbone model * adding a very small task-specific head (often just a single linear layer) * fine-tuning in a supervised way on a task specific dataset * The shift in perspective was brought about by `GPT3` model (2020). The model (or model family to be precise) exhibited so called emergent properties (properties not intended by the folks training the models): * the larger models were able to solve some NLP tasks without any fine-tuning by carefully designing `prompts`. * the larger models seemed to be able to perform some kind of reasoning ----  ### 2.1 Zero-shot and few-shot prompting Models like `GPT3` were found to be able to solve some NLP tasks by carefully designing queries, giving examples and so on. `Prompt engineering` was born. <img src="https://hackmd.io/_uploads/SyJR3yYVn.png" width="800" height="600"> --- ## 3. Model types  We can roughly categorize `LLMs` into three types: * ==*Encoder-only*== (**BERT, RoBERTa, etc.**). These language models produce contextual embeddings but cannot be used directly to generate text. These contextual embeddings are generally used for classification tasks. * ==*Decoder-only*== (**GPT-2, GPT-3, etc.**). These are our standard autoregressive language models, which given a prompt $x1:i$ produces both contextual embeddings and a distribution over next tokens $xi+1$ (and recursively, over the entire completion $xi+1:L$). * ==*Encoder-decoder*== (**BART, T5, etc.**). These models in some ways can the best of both worlds: they can use bidirectional contextual embeddings for the input $x1:L$ and can generate the output $y1:L$. --- ## 4. Overview of architectures.  * In essence most NLP models today are based on `transformer` architecture. * It is surprisingly simple and elegant: `input` :arrow_right: `word_embedding` :arrow_right: `L` x (`attention_block` + `mlp_block`) * What may one of its most important characteristics is this: a `transformer` can easily be scaled to billions of parameters and it is relatively easy to run distributed training <img src="https://hackmd.io/_uploads/HJHrDxuNh.png" width="400" height="400"> ----  Different `transformer` models differ mostly in: * model width (hidden dimension of token representation) * model depth `L` (number of `attention_block` + `mlp_block` repeats) * type of positional embedding * attention mechanism * tokenization mechanism > **Note**: All these options matter, but by far the most differentiating factors are: > * model size > * training data ----  ### 4.1 Model size Model size has been growing exponentially for some time, but it seems to be reaching some limits ;) * `BERT` 340M params * `GPT2` 1.5B params * `GPT3` 175B params <img src="https://hackmd.io/_uploads/rJU1M7dE3.png" width="900" height="520"> --- ## 5. Overview of evaluation benchmarks.  There is a huge number of evaluation benchmarks. Some of them are pretty standard and model comparison is not that difficult. However, with the advent of models like `GPT3` (emergent properties, prompting, one-shot and few-shot evaluations) it has become much more difficult. To give you a taste of what kind of evaluations are performed, have a look at these: * `LAMBADA` (predict the last word of a sentence) * `HellaSwag` (choose the most appropriate completion for a sentence from a list of choices) * `TriviaQA` (closed book question answering) * `WebQuestions` (closed book question answering) * `Massive Multitask Language Understanding`: 57 multiple-choice problems spanning mathematics, US history, computer science, law, etc. * `TruthfulQA`: question answering dataset that humans would answer falsely due to misconceptions. ---  ## 6. Overview of datasets for pretraining. * The pre-training phase of `LLMs` requires a massive amount of raw text. It is safe to say that most of the recent `LLMs` have, e.g, seen all of open-access literature, huge chunks of news articles, the whole of wikipedia, reddit, stack exchange and so on. * Addittionaly, many of the newer models have also been pre-trained on heaps of open-source code repositories. > **Note**: Data cleaning and curation are extremely important and include: > * removing poor quality text > * deduplication (removing copies of the same text) > * removing some sensitive and offensive data * With time the amount of training data has been growing exponentially and at this point one can encounter training procedures that remove wikipedia from the training data to perform evaluation! * We are, so to speak, running out of tokens to train! --- ## 7 Adaptation and fine-tuning There are various approaches to fine-tuning, with some of them taking efficiency and hardware limitations into consideration. ----  ### 7.1 Probing Probing is just adding one or two additional layers and training just these treating the chosen `LLM` as a block box: `inputs` :arrow_right: `LLM` (frozen) :arrow_right: `head` (trainable) -> `output` <img src="https://stanford-cs324.github.io/winter2022/lectures/images/adaptation_CLS.png" width="400" height="300"> ----  ### 7.2 Fine-tuning * Fine-tuning uses the language model parameters $\theta_{LM}$ as initialization for optimization * The family of optimized parameters contains all `LM` parameters and task-specific prediction head parameters * Fine-tuning usually uses at least a one order of magnitude **smaller learning rate** than during pre-training and is much shorter than pre-training * Fine-tuning requires storing a large language model specialized for every downstream task, which can be expensive * However, fine-tuning optimizes over a larger family of models (i.e., very expressive), and usually has better performance than probing ----  #### 7.3 Lightweight Fine-tuning Lightweight fine-tuning aims to have the expressivity of full fine-tuning while not requiring us to store the full language model for every task. * **Prompt tuning** :arrow_right: soft prompts are learned through backpropagation and can be tuned to incorporate signals from any number of labeled examples * **Prefix tuning** :arrow_right: for `k` positions prepended to the input, concatenate additional learnable weights for keys and values at every attention layer. Different to prompt tuning (only learnable input vectors) * **LoRA** :arrow_right: low-rank matrix decomposition approach to fine-tuning with much lower memory requirements than standard fine-tuning <img src="https://hackmd.io/_uploads/ryLKWVWV3.png" width="150" height="150"> There is a `Hugging Face`:hugging_face: library called [`peft`](https://github.com/huggingface/peft) that incorporates many of these approaches. ---  ## 8. Instruction following and dialogue like capabilities * In general models trained in the next-word-prediction paradigm: * can be used with carefully designed prompts to extract the type of answer we expect/want * are not very good at following instructions * In order to overcome that the authors of [InstructGPT](https://arxiv.org/abs/2203.02155) have introduced a way to fine-tune pre-trained `LLMs` so that they give more natural and coherent answers. They developed a process for such fine-tuning which is known as `reinforcement learning with human feedback`. In the original implementation `GPT3-like` model was fine-tuned on ==13k== human-laballed text exmaples. This has resulted in a huge increase of perceived (and objectively measured) quality even more smaller models. > Note: `ChatGPT` was trained on much more data than the initial ==13k== examples ---- ### 8.1 Reinforcement Learning from Human Feedback (RLHF)  Based on :hugging_face: [blogpost](https://huggingface.co/blog/rlhf) `RLHF` can be broken down into: #### 1. Pre-training a `LLM` or taking a pre-trained one. #### 2. Gathering data and training a reward model `RM` #### 3. Fine-tuning the `LLM` with reinforcement learning using `RM`  ----  #### 2. Gathering data and training a reward model <img src="https://hackmd.io/_uploads/H1NMXOYNh.png" width="250" height="200"> * The underlying goal is to get a `reward model` that takes in a sequence of text, and returns a scalar reward which should numerically represent the human preference. * A `reward model` (`RM`) can be both another fine-tuned language model or a language model trained from scratch on the preference data. * The training of the `reward model` goes as follows: * The training dataset of prompt-generation pairs for the `reward model` is generated by sampling a set of prompts from a predefined dataset. * The prompts are passed through the initial language model `LLM` to generate new text. * Human annotators are used to rank the generated text outputs from the` `LLM`. ----  #### 3. Fine-tuning the `LLM` with reinforcement learning * Given a prompt, `x`, from the dataset, two texts, `y1`, `y2`, are generated – one from the initial language model and one from the current iteration of the fine-tuned policy. * The text from the current policy is passed to the reward model `RM`, which returns a scalar notion of “preferability”, $r_\theta$ * This text is compared to the text from the initial model to compute a penalty on the difference between them. * The details can be also found [here](https://wandb.ai/ayush-thakur/RLHF/reports/Understanding-Reinforcement-Learning-from-Human-Feedback-RLHF-Part-1--VmlldzoyODk5MTIx). Takes some knowledge of reinforcement learning to understand that! ----  ### 8.2 Smaller models also benefit immensely from instruction fine-tuning * Recently (from March 2023 until today) there has been a huge effort in the open-source community to bring `ChatGPT`-like features with smaller and open source models: - [`Koala`](https://bair.berkeley.edu/blog/2023/04/03/koala/) -> (BAIR) April 2023 - [`Dolly`](https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm) -> (databricks) April 2023 :arrow_right: :airplane: commercial usage allowed - [`MPT-7B`](https://www.mosaicml.com/blog/mpt-7b) -> (MosaicML) May 2023 :arrow_right: :airplane: commercial usage allowed * These models were created by taking a relatively small `LLM` (5-10B parameters), using next-word-prediction pre-training and then instruction fine-tuning on: * publicly available data * data generated by the organization doing the training * Don't expect them to be as good as `ChatGPT` yet (although on many benchmarks there is little difference), but in many real life scenarios they are viable alternatives. > Note: I have actually managed to run `dolly-v2-3b` (3B parameter model) on my laptop on CPU :hugging_face: [link](https://huggingface.co/databricks/dolly-v2-3b) --- ## 9. Practical Guide for `NLP` Tasks <img src="https://raw.githubusercontent.com/Mooler0410/LLMsPracticalGuide/main/imgs/decision.png" width="800" height="350">