Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

# Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes https://arxiv.org/abs/2305.02301 ## Backgrounds - Deploying LLMs cost huge computational resources and high inference latency - Fine-tuning small models requires large amount of labeled data - Distillation from large to small models requires huge computational resources ## Method ### Step 1. Enhance LLMs' (teacher) CoT capability (or grab one pre-trained with CoT) ![](https://hackmd.io/_uploads/H1gZH8s4n.png) ### Step 2. Distillation Step-by-Step ![](https://hackmd.io/_uploads/SJIZBIjVh.png) ## Models Teachers - PaLM with Few-shot CoT (540B) - PINTO Tuning (?) Students - T5-Base (0.2B) - T5-Large (0.7B) - T5-XXL (11B) ## Datasets (all are formulated as classification tasks) - e-SNLI (549k): SNLI with explanations - ANLI (160k): Adversarial NLI (more difficult) - CQA (9.5k): Commonsense QA - SVAMP (0.7k): Mathematical problems with explanations (equations) ## Experimental results ### Reducing Dataset Size (fix model size to 0.2B T5-Base) Distillation Step-by-Step v.s. Standard Fine-Tuning ![](https://hackmd.io/_uploads/HyWQH8iN3.png) Distillation Step-by-Step v.s. Standard Distillation ![](https://hackmd.io/_uploads/H1p7r8oEh.png) ### Reduce Student Model Size Distillation Step-by-Step v.s. Standard Fine-Tuning ![](https://hackmd.io/_uploads/S1kHBIjE3.png) Distillation Step-by-Step v.s. Standard Distillation ![](https://hackmd.io/_uploads/SkCHS8iN2.png) ## Summary 1. Outperform fine-tuning (on small model) and distillation (from large to small model) with smaller dataset for smaller model. 1. Able to outperform LLMs with smaller model or smaller dataset in some cases.