# A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models
###### tags: `graduated` `vqa`

## Abstract
0. significatly affect zero-shot & marginally affect few-shot
1. pre-train a Seq2Seq Transformer
2. Tasks: PrefixLM & MaskedLM
- PrefixLM: good for captioning
- MaskedLM: good for VQA

## Introduction
- answer 3 questions
1. How does prompt design affect zero/few-shot learning on new tasks?
2. Does prompt design still matter given larger training?
3. How do different pre-training objectives affect zero/few-shot learning?
- by
1. hand-crafted prompts
2. noisy prompts
- contributions
1. MaskedLM helps few-shot VQA
2. PrefixLM helps captioning
3. in **larger training data**, model learns `noisy prompts` as quickly as `hand-crafted prompt`
## 3. Analysis Setup
### 3.1 Problem Formulation
| annotation | description |
| :-: | - |
| $\mathcal L$ | VL model |
| $\mathcal D_{train/dev/test}$ | train/dev/test dataset |
- in few-shot set $|\mathcal D_{train}| = |\mathcal D_{dev}| = 16$
### 3.2 Analysis Questions
1. How does promt design affect zero/few-shot on new tasks?
- different promts, hand-crafted and noisy promts
2. Does prompt design still matter given larger training data?
- train with different sizes of training data
3. How do different pre-training objectives affect zero/few-shot performance?
- objectives with PrefixLM, MaskedLM
### 3.3 Downstream Tasks and Datasets
tasks: VQA, Captioning, and Categorical
1. VQA: VQAv2, OKVQA, GQA
2. Captioning: NoCaps, Flickr30k(29k, 1014, 1k)
3. Categorical: miniImageNet(meta learning)
### 3.4 Evaluation Metrics
- few-shot
- sample and measure average of 5 train and dev splits
- 200 epochs and choose the best checkpoint on dev
- Metrics
- VQA and Categorical: accuracy
- Captioning: CIDEr, SPICE
### 3.5 Baseline & Upper Bound
- VQA
- Baseline
1. Frozen
2. PICa
- Upper bound
1. Uniter-large: VQAv2
2. Oscar: GQA
3. VL-T5
- Captioning
- Baseline
1. SimVLM
- Upper bound
1. NoCaps: SimVLM, VinVL
2. Flickr30k: Unified VLP
- Categorical
- Baseline
1. Forzen(designed for few-shot)
2. AFHN(for meta learning)
## 4. Method
### 4.1 Encoder-Decoder VLM
- Transformer
- minimizing the Negative log-likelihood $L_\theta = - \sum_{i=1}^{|y|}logP_\theta(y_i | y<i, x, v)$
- $\theta$: model
- $y$: target text
- $x$: input text
- $v$: input image
### 4.2 Pre-training Objectives
- Prefix LM
- given an image and caption
- randomly split into 2 components
- 1st component as input, 2nd component as output label
- Masked LM
- randomly mask 15% words
- replace masks to <text_1> sentinel tokens then as input
- decoder generates the target text
- Data
- VG
- MS COCO
- 9.18M image-text pairs and 180K distinct images
## 5. Low-resource Adaptation
### 5.1.1 VQA
- Hand-crafted prompts: 3 templates
1. input prompt:
**“question: [Q] answer: <text_1>”**
- expect model generates words thanks to the sentinel token
2. target prompt:
- **"[A]"** only answer
- **"<text_1> [A]"** answer with a sentinel token
- Noisy prompts
1. irrelevant prompt
- incorrect promt?
2. noisy token
- random replace token selected from T5 vocabulary
4. random sentences
- random false captions from MS COCO
### 5.1.2 Captioning
- Hand-crafted prompts:
1. input prompt (3 prompts):
- **"a picture of"**
- **"a photo of"**
- **"an image of"**
- show **different performance** in zero-shot and few-shot tasks
2. target prompt:
- original caption w/o prompt
### 5.1.3 MiniImageNet
- Hand-craft prompts:
1. input prompt:
- **"This is <text_1>"**
2. target prompt:
- **"<text_1> [A]"**
- **prompts are helpful in categorical learning**
## 6. Experiments
### 6.1 Experiment details
- Pre-training
| hparam | value |
| - | - |
| batch_size(base) | 1280 |
| batch_size(large) | 800 |
| epochs | 30 |
| lr | 1e-4 |
| linear warmup | 5% |
- few-shot
| hparam | value |
| - | - |
| batch_size(base) | 1280 |
| epochs | 200 |
| lr | 5e-5 |
| linear warmup | 5% |
| **VQA** |
| input prompt | "question: [Q] answer <text_1>" |
| target prompt | "<text_1> [A]" |
| train dataset | 16 |
| dev dataset | 16 |
| **Caption** |
| input prompt | "an image of" |
| target prompt | None |
| train dataset | 16 |
| dev dataset | 16 |
| **MiniImageNet** |
| input prompt | "This is <text_1>" |
| target prompt | "<ext_1> [A]" |
### 6.2 Performance on zero-shot

### 6.3 Performance on few-shot
