【論文筆記】Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

--- tags: Paper, LLM --- # 【論文筆記】Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes 論文連結： https://arxiv.org/abs/2305.02301 ## Introduction 大型語言模型（Large Language Model, LLM）通常需要很大的記憶體以及很多的運算資源才能夠運作，所以在實務上常常會透過 finetuning 或 distilling 來訓練一個比較小的 task-specific model。但是 finetuning 和 distilling 需要大量的 training data，才能達到和 LLM 差不多的表現。本篇作者提出 Distilling step-by-step 來訓練 task-specific model，這個方法可以用更少的 training data 訓練出 size 更小的模型，但能力可以匹敵甚至超越 LLM。 ![](https://hackmd.io/_uploads/S1_IzDega.png) ## Methods LLM 具有生成 rationales 的能力，也就是可以為自己的 prediction 提供解釋，Distilling step-by-step 最主要就是使用 LLM 生成的 rationales 來訓練模型以達成目的。下圖概述了 Distilling step-by-step 的整個過程：提取 labels 和 rationales，接著用來訓練 model。 ![](https://hackmd.io/_uploads/BJLWeKlx6.png) ### Extracting Rationales 首先使用 few-shot Chain-of-Thought (CoT) prompting 來提取出 rationales。具體而言，給定一個 prompt template 說明這個任務應該如何解決，這個 template 裡會包含 input $x$，label $y$ 和 rationale $r$，接下來 LLM 就會模仿前面的範例，給定 input 之後，輸出對應的回答和 rationale。 ![](https://hackmd.io/_uploads/Sk1htsGla.png) ### Training Models with Rationales 一般來說，不論是用 human-annotated labels 進行 finetuning 或是用 LLM teachers 生成 labels 進行 distillation，我們都會透過最小化以下的 label prediction loss 來訓練一個小的 task-specific model $f$： $$ \mathcal{L}_\text{label} = \frac{1}{N}\sum^N_{i=1} \ell(f(x_i), \hat{y}_i) $$ 其中 $\ell$ 表示 cross-entropy loss。 Distilling step-by-step 定義新的 loss function 為 $$ \mathcal{L} = \mathcal{L}_\text{label} + \lambda \mathcal{L}_\text{rationale} $$ 也就是把學習任務看作是一個 multi-task problem，其中 $\mathcal{L}_\text{rationale}$ 是 rationale generation loss，定義為 $$ \mathcal{L}_\text{rationale} = \frac{1}{N}\sum^N_{i=1} \ell(f(x_i), \hat{r}_i) $$ 這樣的設計可以讓模型學習生成對 prediction 合理的解釋，因此引導模型產生更好的 prediction。 ## Experiments 考慮一個 540B PaLM model 當做 LLM，T5 models 作為下游 task-specific model，下面的實驗比較 Distilling step-by-step 和 standard finetuning 在 human-labeled datasets 的結果，以及和standard distillation 在各種 unlabeled dataset 的結果。可以看到 Distilling step-by-step 用更少的資料勝過了 finetuning 和 distillation。 ![](https://hackmd.io/_uploads/ByeysaMgp.png) ![](https://hackmd.io/_uploads/BkFJi6Gla.png) 下面的實驗固定訓練集大小為 100% 的 dataset，比較各種不同 size 的 T5 models 用 Distilling step-by-step、standard finetuning 在 human-labeled datasets 的結果，以及 Distilling step-by-step 和 standard distillation 在 unlabeled dataset 的結果。作為比較，另外引入了 Few-shot CoT、PINTO tuning 作為 LLM 的 baseline。 - Few-shot CoT：直接用 CoT prompt 540B PaLM - PINTO tuning：對 LLM 針對 reasoning 進行 fine-tune 從結果可以發現，Distilling step-by-step 在各種 model size 下，都比 standard finetuning 和 distillation 表現更好。Distilling step-by-step 也能夠用比較小的模型勝過 LLM 的表現。 ![](https://hackmd.io/_uploads/SJT_n6Gx6.png) ![](https://hackmd.io/_uploads/Syx53pMxp.png) 下面的實驗探討 Distilling step-by-step 是否真的能在 training size 和 model size 兩方面都更少的情況下勝過 LLM。分別在 human-labeled dataset 和 unlabeled dataset 上訓練的結果如下圖： ![](https://hackmd.io/_uploads/HJ5NrCMeT.png) ![](https://hackmd.io/_uploads/SJBHBCMx6.png) 從實驗結果可以看到，Distilling step-by-step 的確可以用更少的 data 和更小的 model 表現比 LLM 還要好，而 standard finetuning 和 distillation 則需要的較多的 data 較大的 model 才能達到和 LLM 差不多的表現。 ## Limitations Distilling step-by-step 仍然有一些限制： 1. 使用者必須提供示例才能做 few-shot CoT 2. LLM 在更複雜的任務上可能沒有辦法給出很好的 rationale