📃 Finetuned Language Models are Zero-Shot Learners (FLAN)

--- tags: paper-reading --- # 📃 Finetuned Language Models are Zero-Shot Learners (FLAN) Wei, Jason et al. from Google Research Published as a conference paper at ICLR --- - Github: https://github.com/google-research/flan. - Arxiv: https://arxiv.org/abs/2109.01652 --- ## Intro The motivation of **instruction tuning** is to improve the ability of language models to respond to NLPinstructions. ![](https://i.imgur.com/Lr2haXG.png) ![](https://i.imgur.com/pIlq0hK.png) Transform existing datasets into instruction format. - Aggregating 62 text datasets - NLI, Sentiment, Closed-book QA, Reading Comprehension, ... - ==Sentiment: IMDB, Sent140, SST-2, Yelp== Try multiple templates on the same task (all NLP templates) - discrete template (can never be exhaustively tried) - human understandable ## FLAN: INSTRUCTION TUNING IMPROVES ZERO-SHOT LEARNING ### 2.3 Classification Task with Options ![](https://i.imgur.com/cQxAGCb.png) 使用的 template: :::success [text] [OPTIONS] [option1][option2]... ::: ### 2.4 Training Details #### Architecture: LaMDA-PT - left to right, decoder-only transformer - 137B, 2.49T BPE tokens (32K vocab by SentencePiece library) - 90% English, 10% non-English web texts #### Instruction-tuning Procedure - FLAN: LaMDA-PT 的 instruction-tuned 版本 - Randomly sample from 62 datasets, each dataset is trimmed to 30K examples (balanced size). - Fine-tuning all models for 30k gradient steps (lr = 3e-5) - *packing* trick to combine multiple training examples into a single sequence (to speed up?) - 12 種型別的任務分成 task clusters，訓練在某些 task clusters 上並 hold out 其他 clusters 作為結果驗證。有些 task clusters 並不是互斥的，像是 reading comprehension & common sense cluster 和 reading comp. 與 common sense 這兩個 clusters 都有重疊，則要是使用了 reading comprehension & common sense cluster 在 eval phase，則 instruction-tuning 時就不會使用這兩者任一個 cluster 的資料集。 - 這樣的訓練模式在 paper 內叫做 zero-shot learning。 ![](https://i.imgur.com/F17QKCB.png) ![](https://i.imgur.com/W2ArEV0.png) ### Results & Ablation Studies 1. 在很好做 verbal 引導的任務上都有不錯的表現（如 NLI，Reading Comp.等），但在那些本來就是做成 language modeling tasks （sentence completion）的任務上表現沒有很有效（如 coref-resolution, commonsense reasoning）。 2. 衍伸第一點小結，下游任務如果和原本 pretraining 目標是相同的，可能 instrcution tuning 就沒那麼好用了。 3. (4.2) Scaling laws: Model Size 在 8B 參數量以上時有一個表現上的飛躍。然而在 8B 以下的表現可能比 untuned model 更糟。故參數量的多寡有決定性的影響，只有 large models 適用於 instruction tuning。 ![](https://i.imgur.com/fGJLdrp.png) ![](https://i.imgur.com/pIA0wdU.png) 4. (4.3) Ablation studies on instruction 欲知道 instruction 在 FLAN 中扮演的角色，另測試幾個設計。 FLAN 的原始設定是 tuning 與 eval 時都使用 instruction 設計。 - No instruction：不告訴模型要做什麼，只單純給 input 與 output。eg.在翻譯任務時就只輸入英文，希望可以輸出法文對應翻譯。 Input: “The dog runs.” Output: “Le chien court.” - Dataset name: 每個任務都有一個任務名（任務名中可能有模型可以提取的任務提示）。 Input: “[Translation: WMT’14 to French] The dog runs.” Output: “Le chien court.” - Instruction: Input: “Please translate this sentence to French: ‘The dogruns.’” Output: “Le chien court.” ![](https://i.imgur.com/PITLqcL.png =400x) 5. (4.4) Few-shot learning 前述的作法都是 zero-shot，先 tune 在某些 task clusters 上再 inference 於其餘的 task clusters 上。現在 few-shot 的作法是在 train 和 eval 時都抽取一些 input, output concatenate 好的 sequences （一個 in-out 叫做一個 exemplar ）輸入給模型（我猜應是不管他的 output），目的：單純讓它有機會看到答案。出於 seq length 的限制，輸入的句長不超過 960 tokens，也就是少於 16 組相接的 exemplars。至於為什麼要多組 exemplars 串接，應該是為了增快訓練速度，我猜等同於上面 2.4 training details 的 packing。以下為原文收錄： For some input $x$ and output $y$, let $instruct(x)$ denote the zero-shot instructions. Then, given $k$ few-shot exemplars $(xi, yi)^k_{i=1}$ and a new input $x$, the instruction format for the few-shot setting is $$ instruct(x_1)⊕y1⊕instruct(x_2)⊕y_2⊕. . .⊕instruct(x_k)⊕y_k $$ , where ⊕ denotes string concatenation with a delimiter token inserted in between. 6. (4.5) 使用 instruction-tuned models 來執行 Prompt-tuning 比 untuned models 來得更好。對大模型使用 instr-tuning 後再應用於下游 nlp 任務會是較好的作法（a better checkpoint）。從下圖可以看到（這應該是非常自然而然的傾向）有經過 instr-tuned 的模型表現在 unseen tasks 上的確較好。 Untuned model: LaMDA-PT Instruction model: FLAN, the instruction-tuned LaMDA-PT ![](https://i.imgur.com/efTI9qt.png =400x) ## Discussion 本論文想要回答的問題 "Does finetuning a model on acollection of tasks phrased as instructions improve its performance on unseen tasks?" 根據這個想法發展的模型（FLAN）與相關實驗說明答案是肯定的。 FLAN 在大部分這篇論文有探討的資料集任務上甚至勝過 zero-shot GPT-3。該模型（訓練法）的限制為 1. task clusters 的分類規範有一定程度主觀性。 2. 只探討一句話且簡短的 instruction template，沒有辦法窮盡探討超長的 instruction templates 等。 3. model pretraining 的資料文本和 instruction-tuning 的資料集可能有重複，這點無法避免。在 post-hoc 分析 (appendix c) 中有檢視過應該沒有造成太大影響。 4. FLAN 137B 的參數大小要 tune 屬實是個計算負擔。 ## Future Work Future work on instruction tuning could include gathering/generating even more task clusters for finetuning, cross-lingual experiments, using FLAN to generate data fortraining downstream classifiers, and using finetuning to improve model behavior with respect to biasand fairness (Solaiman & Dennison, 2021).