LLaMA: Open and Efficient Foundation Language Models - HackMD

- - Sharing URL
  - /edit
  - View mode
    
    Edit mode
    
    View mode
    
    Book mode
    
    Slide mode
    Edit mode View mode Book mode Slide mode
  - Customize slides
  - Note Permission
  - Read
    Owners
    
    Signed-in users
    
    Everyone
    Owners Signed-in users Everyone
  - Write
    Owners
    
    Signed-in users
    
    Everyone
    Owners Signed-in users Everyone
  - Engagement control Commenting, Suggest edit, Emoji Reply
- Invite by email
  
  Invitee
  
  This note has no invitees
- Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note
  
  Your note will be visible on your profile and discoverable by anyone.
  
  Your note is now live.
  
  This note is visible on your profile and discoverable online.
  
  Everyone on the web can find and read all notes of this public team.
  
  See published notes
  
  Unpublish note
  
  I agree to HackMD’s Community Guideline. Please check the box to agree to the Community Guidelines.
  
  View profile
- Commenting
  Permission
  Disabled Forbidden Owners Signed-in users Everyone
- Enable
- Permission
  Forbidden
  
  Owners
  
  Signed-in users
  
  Everyone
- Suggest edit
  
  Permission
  Disabled Forbidden Owners Signed-in users Everyone
- Enable
- Permission
  Forbidden
  
  Owners
  
  Signed-in users
- Emoji Reply
- Enable

Owned this note Owned this note

<style> img { display: block; margin-left: auto; margin-right: auto; } </style> > [Paper link](https://arxiv.org/pdf/2302.13971.pdf) | [Note link](https://zhuanlan.zhihu.com/p/617745693) | [Code link](https://github.com/facebookresearch/llama/tree/llama_v1) | arXiv 2023 ## Abstract LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. **They release all their models to the research community** ## Introduction The effort on scaling LLMs are based on the assumption that more parameters will lead to better performance. However, [recent work](https://hackmd.io/0msrrwJ6QmuiaDyFtF_cPQ) shows that, for a given compute budget, **the best performances are not achieved by the largest models, but by smaller models trained on more data.** The focus of this work is to train a series of language models that achieve the best possible performance at various inference budgets, by training on more tokens than what is typically used. Unlike Chinchilla, PaLM, or GPT-3, they only use publicly available data, making their work com- patible with open-sourcing In the rest of this paper, they present an overview of the modifications we made to the transformer architecture, as well as their training method. ## Approach Their method is inspired by the **Chinchilla scaling laws**. They train large transformers on a large quantity of textual data using a standard optimizer. ### Pre-training Data Their training dataset is a mixture of several sources: ![](https://hackmd.io/_uploads/HJ5QGuFdn.png) For the most part, they reuse data sources that have been leveraged to train other LLMs, with the restriction of only using data that is publicly available, and compatible with open sourcing. **Tokenizer.** They tokenize the data with the byte-pair encoding (BPE) algorithm. They split all numbers into individual digits, and fallback to bytes to decompose unknown UTF-8 characters. Overall, their entire training dataset contains roughly 1.4T tokens after tokenization. For most of their training data, each token is used only once during training, with the exception of the Wikipedia and Books domains, over which we perform ap- proximately two epochs. ### Architecture Their network is **based on the transformer** architecture. They **leverage various improvements that were subsequently proposed, and used in different models** such as PaLM. **Pre-normalization [GPT3].** To improve the training stability, they normalize the input of each transformer sub-layer, instead of normalizing the output. **SwiGLU activation function [PaLM].** They replace the ReLU non-linearity by the **SwiGLU activation function.** They use a dimension of $\frac{2}{3} 4d$ instead of $4d$ as in PaLM. **Rotary Embeddings [GPTNeo].** They remove the absolute positional embeddings, and instead, add **rotary positional embeddings (RoPE)**. ### Optimizer They use AdamW optimizer, with hyper-parameters $\beta_1=0.9$, $\beta_2=0.95$. They use a weight decay of $0.1$ and gradient clipping of $1.0$. They use $2, 000$ warmup steps, and vary the learning rate and batch size with the size of the model. ![](https://hackmd.io/_uploads/ByzPdOKOh.png) ### Efficient implementation They use an efficient implementation of the causal multi-head attention to reduce memory usage and runtime. This is achieved by **not storing the attention weights and not computing the key/query scores that are masked** due to the causal nature of the language modeling task. Also they reduced the amount of activations that are recomputed during the backward pass with checkpointing. And they save the activations that are expensive to compute, such as the outputs of linear layers. This is achieved by manually implementing the backward function for the transformer layers, instead of relying on the PyTorch autograd. ## Main results They consider zero-shot and few-shot tasks, and report results on a total of 20 benchmarks: * **Zero-shot.** They provide a textual description of the task and a test example. The model either provides an answer using open-ended generation, or ranks the proposed answers. * **Few-shot.** They provide a few examples of the task (between 1 and 64) and a test example. The model takes this text as input and generates the answer or ranks different options. ### Common Sense Reasoning ![](https://hackmd.io/_uploads/B16nC_tdn.png) ### Closed-book Question Answering ![](https://hackmd.io/_uploads/SJOEJYK_n.png) ![](https://hackmd.io/_uploads/SkLH1YFOh.png) ![](https://hackmd.io/_uploads/r1rL1YKd2.png) More experiment, please see the original paper. - Reading Comprehension - Mathematical reasoning - Code generation - Massive Multitask Language Understanding ### Evolution of performance during training During training, they tracked the performance of their models on a few question answering and common sense benchmarks. ![](https://hackmd.io/_uploads/HyuQlFYO2.png) ![](https://hackmd.io/_uploads/SyuHltY_n.png) ## Instruction Finetuning In this section, they show that briefly finetuning on instructions data rapidly leads to improvements on massive multitask language understanding (MMLU). ![](https://hackmd.io/_uploads/Bys9eYF_h.png) ## Related work - Language models - Architecture - Scaling ## Conclusion Unlike previous studies, they show that it is possible to achieve state-of-the-art performance by training exclusively on publicly available data, without resorting to proprietary datasets. Additionally, they observed that finetuning these models on instructions lead to promising results, and we plan to further investigate this in future work.

Cheatsheet

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.