<style> img { display: block; margin-left: auto; margin-right: auto; } </style> > [Paper link](https://arxiv.org/pdf/2302.13971.pdf) | [Note link](https://zhuanlan.zhihu.com/p/617745693) | [Code link](https://github.com/facebookresearch/llama/tree/llama_v1) | arXiv 2023 ## Abstract LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. **They release all their models to the research community** ## Introduction The effort on scaling LLMs are based on the assumption that more parameters will lead to better performance. However, [recent work](https://hackmd.io/0msrrwJ6QmuiaDyFtF_cPQ) shows that, for a given compute budget, **the best performances are not achieved by the largest models, but by smaller models trained on more data.** The focus of this work is to train a series of language models that achieve the best possible performance at various inference budgets, by training on more tokens than what is typically used. Unlike Chinchilla, PaLM, or GPT-3, they only use publicly available data, making their work com- patible with open-sourcing In the rest of this paper, they present an overview of the modifications we made to the transformer architecture, as well as their training method. ## Approach Their method is inspired by the **Chinchilla scaling laws**. They train large transformers on a large quantity of textual data using a standard optimizer. ### Pre-training Data Their training dataset is a mixture of several sources: ![](https://hackmd.io/_uploads/HJ5QGuFdn.png) For the most part, they reuse data sources that have been leveraged to train other LLMs, with the restriction of only using data that is publicly available, and compatible with open sourcing. **Tokenizer.** They tokenize the data with the byte-pair encoding (BPE) algorithm. They split all numbers into individual digits, and fallback to bytes to decompose unknown UTF-8 characters. Overall, their entire training dataset contains roughly 1.4T tokens after tokenization. For most of their training data, each token is used only once during training, with the exception of the Wikipedia and Books domains, over which we perform ap- proximately two epochs. ### Architecture Their network is **based on the transformer** architecture. They **leverage various improvements that were subsequently proposed, and used in different models** such as PaLM. **Pre-normalization [GPT3].** To improve the training stability, they normalize the input of each transformer sub-layer, instead of normalizing the output. **SwiGLU activation function [PaLM].** They replace the ReLU non-linearity by the **SwiGLU activation function.** They use a dimension of $\frac{2}{3} 4d$ instead of $4d$ as in PaLM. **Rotary Embeddings [GPTNeo].** They remove the absolute positional embeddings, and instead, add **rotary positional embeddings (RoPE)**. ### Optimizer They use AdamW optimizer, with hyper-parameters $\beta_1=0.9$, $\beta_2=0.95$. They use a weight decay of $0.1$ and gradient clipping of $1.0$. They use $2, 000$ warmup steps, and vary the learning rate and batch size with the size of the model. ![](https://hackmd.io/_uploads/ByzPdOKOh.png) ### Efficient implementation They use an efficient implementation of the causal multi-head attention to reduce memory usage and runtime. This is achieved by **not storing the attention weights and not computing the key/query scores that are masked** due to the causal nature of the language modeling task. Also they reduced the amount of activations that are recomputed during the backward pass with checkpointing. And they save the activations that are expensive to compute, such as the outputs of linear layers. This is achieved by manually implementing the backward function for the transformer layers, instead of relying on the PyTorch autograd. ## Main results They consider zero-shot and few-shot tasks, and report results on a total of 20 benchmarks: * **Zero-shot.** They provide a textual description of the task and a test example. The model either provides an answer using open-ended generation, or ranks the proposed answers. * **Few-shot.** They provide a few examples of the task (between 1 and 64) and a test example. The model takes this text as input and generates the answer or ranks different options. ### Common Sense Reasoning ![](https://hackmd.io/_uploads/B16nC_tdn.png) ### Closed-book Question Answering ![](https://hackmd.io/_uploads/SJOEJYK_n.png) ![](https://hackmd.io/_uploads/SkLH1YFOh.png) ![](https://hackmd.io/_uploads/r1rL1YKd2.png) More experiment, please see the original paper. - Reading Comprehension - Mathematical reasoning - Code generation - Massive Multitask Language Understanding ### Evolution of performance during training During training, they tracked the performance of their models on a few question answering and common sense benchmarks. ![](https://hackmd.io/_uploads/HyuQlFYO2.png) ![](https://hackmd.io/_uploads/SyuHltY_n.png) ## Instruction Finetuning In this section, they show that briefly finetuning on instructions data rapidly leads to improvements on massive multitask language understanding (MMLU). ![](https://hackmd.io/_uploads/Bys9eYF_h.png) ## Related work - Language models - Architecture - Scaling ## Conclusion Unlike previous studies, they show that it is possible to achieve state-of-the-art performance by training exclusively on publicly available data, without resorting to proprietary datasets. Additionally, they observed that finetuning these models on instructions lead to promising results, and we plan to further investigate this in future work.