# A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models ###### tags: `graduated` `vqa` ![](https://i.imgur.com/bhgLZGM.png) ## Abstract 0. significatly affect zero-shot & marginally affect few-shot 1. pre-train a Seq2Seq Transformer 2. Tasks: PrefixLM & MaskedLM - PrefixLM: good for captioning - MaskedLM: good for VQA ![](https://i.imgur.com/hb384iq.png) ## Introduction - answer 3 questions 1. How does prompt design affect zero/few-shot learning on new tasks? 2. Does prompt design still matter given larger training? 3. How do different pre-training objectives affect zero/few-shot learning? - by 1. hand-crafted prompts 2. noisy prompts - contributions 1. MaskedLM helps few-shot VQA 2. PrefixLM helps captioning 3. in **larger training data**, model learns `noisy prompts` as quickly as `hand-crafted prompt` ## 3. Analysis Setup ### 3.1 Problem Formulation | annotation | description | | :-: | - | | $\mathcal L$ | VL model | | $\mathcal D_{train/dev/test}$ | train/dev/test dataset | - in few-shot set $|\mathcal D_{train}| = |\mathcal D_{dev}| = 16$ ### 3.2 Analysis Questions 1. How does promt design affect zero/few-shot on new tasks? - different promts, hand-crafted and noisy promts 2. Does prompt design still matter given larger training data? - train with different sizes of training data 3. How do different pre-training objectives affect zero/few-shot performance? - objectives with PrefixLM, MaskedLM ### 3.3 Downstream Tasks and Datasets tasks: VQA, Captioning, and Categorical 1. VQA: VQAv2, OKVQA, GQA 2. Captioning: NoCaps, Flickr30k(29k, 1014, 1k) 3. Categorical: miniImageNet(meta learning) ### 3.4 Evaluation Metrics - few-shot - sample and measure average of 5 train and dev splits - 200 epochs and choose the best checkpoint on dev - Metrics - VQA and Categorical: accuracy - Captioning: CIDEr, SPICE ### 3.5 Baseline & Upper Bound - VQA - Baseline 1. Frozen 2. PICa - Upper bound 1. Uniter-large: VQAv2 2. Oscar: GQA 3. VL-T5 - Captioning - Baseline 1. SimVLM - Upper bound 1. NoCaps: SimVLM, VinVL 2. Flickr30k: Unified VLP - Categorical - Baseline 1. Forzen(designed for few-shot) 2. AFHN(for meta learning) ## 4. Method ### 4.1 Encoder-Decoder VLM - Transformer - minimizing the Negative log-likelihood $L_\theta = - \sum_{i=1}^{|y|}logP_\theta(y_i | y<i, x, v)$ - $\theta$: model - $y$: target text - $x$: input text - $v$: input image ### 4.2 Pre-training Objectives - Prefix LM - given an image and caption - randomly split into 2 components - 1st component as input, 2nd component as output label - Masked LM - randomly mask 15% words - replace masks to <text_1> sentinel tokens then as input - decoder generates the target text - Data - VG - MS COCO - 9.18M image-text pairs and 180K distinct images ## 5. Low-resource Adaptation ### 5.1.1 VQA - Hand-crafted prompts: 3 templates 1. input prompt: **“question: [Q] answer: <text_1>”** - expect model generates words thanks to the sentinel token 2. target prompt: - **"[A]"** only answer - **"<text_1> [A]"** answer with a sentinel token - Noisy prompts 1. irrelevant prompt - incorrect promt? 2. noisy token - random replace token selected from T5 vocabulary 4. random sentences - random false captions from MS COCO ### 5.1.2 Captioning - Hand-crafted prompts: 1. input prompt (3 prompts): - **"a picture of"** - **"a photo of"** - **"an image of"** - show **different performance** in zero-shot and few-shot tasks 2. target prompt: - original caption w/o prompt ### 5.1.3 MiniImageNet - Hand-craft prompts: 1. input prompt: - **"This is <text_1>"** 2. target prompt: - **"<text_1> [A]"** - **prompts are helpful in categorical learning** ## 6. Experiments ### 6.1 Experiment details - Pre-training | hparam | value | | - | - | | batch_size(base) | 1280 | | batch_size(large) | 800 | | epochs | 30 | | lr | 1e-4 | | linear warmup | 5% | - few-shot | hparam | value | | - | - | | batch_size(base) | 1280 | | epochs | 200 | | lr | 5e-5 | | linear warmup | 5% | | **VQA** | | input prompt | "question: [Q] answer <text_1>" | | target prompt | "<text_1> [A]" | | train dataset | 16 | | dev dataset | 16 | | **Caption** | | input prompt | "an image of" | | target prompt | None | | train dataset | 16 | | dev dataset | 16 | | **MiniImageNet** | | input prompt | "This is <text_1>" | | target prompt | "<ext_1> [A]" | ### 6.2 Performance on zero-shot ![](https://i.imgur.com/hfaEkeT.png) ### 6.3 Performance on few-shot ![](https://i.imgur.com/zxVpzQK.png)