Llama 2: Open Foundation and Fine-Tuned Chat Models - HackMD

<style> img { display: block; margin-left: auto; margin-right: auto; } </style> > [Paper link](https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/) | [Note link](https://zhuanlan.zhihu.com/p/644440986) | [Code link](https://github.com/facebookresearch/llama) | Meta AI 2023 :::success **Thoughts** - A cumulative of 3.3M GPU hours of computation was performed on hardware of type A100-80GB (TDP of 400W or 350W). - Larger context length can get better result than shorter. ::: ## Abstract In this work, they develop and release **Llama 2**, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. With their fine-tuned LLMs, called **Llama 2-Chat**, are optimized for dialogue use cases. ## Introduction **Llama 2**, an updated version of Llama 1, trained on a new mix of publicly available data. They also increased the size of the pretraining corpus by 40%, doubled the context length of the model, and **adopted grouped-query attention**. They are releasing variants of Llama 2 with 7B, 13B, and 70B parameters. They have also trained 34B variants, which they report on in this paper but are not releasing. **Llama 2-Chat**, a fine-tuned version of Llama 2 that is optimized for dialogue use cases. They release variants of this model with 7B, 13B, and 70B parameters as well. ![](https://hackmd.io/_uploads/B1MYwUPoh.png) ## Pretraining The primary architectural differences from Llama 1 include **increased context length** and **[grouped-query attention (GQA)](https://arxiv.org/pdf/2305.13245.pdf)**. **Context Length** They expand the context window for Llama 2 from 2048 tokens to 4096 tokens. ![](https://hackmd.io/_uploads/Skjr8wPo2.png) **Grouped-Query Attention** ![](https://hackmd.io/_uploads/HyQyOtws2.png) For larger models, where KV cache size becomes a bottleneck, key and value projections can be shared across multiple heads without much degradation of performance. Either the original multi-query format with a single KV projection (MQA) or a grouped-query attention variant with 8 KV projections (GQA) can be used. ![](https://hackmd.io/_uploads/ByldKPwsh.png) ![](https://hackmd.io/_uploads/rJAyBvwsh.png) **Llama 2 Pretrained Model Evaluation** - Code - Commonsense Reasoning - World Knowledge - Reading Comprehension - MATH - Popular Aggregated Benchmarks ![](https://hackmd.io/_uploads/HyezBDvi2.png) ![](https://hackmd.io/_uploads/ByTPrDPs3.png) ## Fine-tuning **Reward Modeling** To train the reward model, they convert their collected pairwise human preference data into a binary ranking label format (i.e., chosen & rejected) and enforce the chosen response to have a higher score than its counterpart. $$ \tag{1} \mathcal{L}_{\text{ranking}} = -\log(\sigma(r_\theta(x,y_c)-r_\theta(x,y_r))) $$ where $r_\theta(x,y)$ is the scalar score output for prompt $x$ and completion $y$ with model weights $\theta$. $y_c$ is the preferred response that annotators choose and $y_r$ is the rejected counterpart. Then they wnat model to be learned by reward model to assign more discrepant scores to the generations that have more differences $$ \tag{2} \mathcal{L}_{\text{ranking}} = -\log(\sigma(r_\theta(x,y_c)-r_\theta(x,y_r) - m(r))) $$ where the margin $m(r)$ is a discrete function of the preference rating. Naturally, they use a large margin for pairs with distinct responses, and a smaller one for those with similar responses. ![](https://hackmd.io/_uploads/S1XS3tPsn.png) **System Message for Multi-Turn Consistency** Ghost Attention (GAtt), a very simple method inspired by Context Distillation that hacks the fine-tuning data to help the attention focus in a multi-stage process. ![](https://hackmd.io/_uploads/H1zwGLxph.png) Assume we have access to a multi-turn dialogue dataset between two persons (e.g., a user and an assistant), with a list of messages $[u_1, a_1, \dots, u_n, a_n]$, where $u_n$ and $a_n$ correspond to the user and assistant messages for turn $n$. Then, we define an instruction, $inst$, that should be respected throughout the dialogue. Next, we can sample from this synthetic data using the latest RLHF model. We now have a context-dialogue and the sample with which to fine-tune a model, in a process analogous to Rejection Sampling. Instead of augmenting all context-dialogue turns with the instruction, we can drop it in all but the first turn, but this would lead to a mismatch at training time between the system message, i.e., all the intermediate assistant messages that come before the last turn, and our sample. To fix this issue, which could hurt the training, we simply set the loss to 0 for all the tokens from the previous turns, including assistant messages. ## Conclusion In this study, they have introduced Llama 2, a new family of pretrained and fine-tuned models with scales of 7 billion to 70 billion parameters. These models have demonstrated their competitiveness with existing open-source chat models, as well as competency that is equivalent to some proprietary models on evaluation sets we examined, although they still lag behind other models like GPT-4.