<style>
img {
display: block;
margin-left: auto;
margin-right: auto;
}
</style>
> [Paper link](https://doi.org/10.48550/ARXIV.2210.02441) | [Note link](https://zhuanlan.zhihu.com/p/609728268) | [Code link](https://github.com/HazyResearch/ama_prompting) | ICLR 2023
:::success
**Thoughts**
:::
## Abstract
Large language models (LLMs) can use prompt demonstrate how to perform the task and no additional training. Designing the prompt for each task need significant effort.
In their method, they collecting multiple effective, yet imperfect, prompts and aggregating them can lead to a high quality prompting strategy.
Ask Me Anything (AMA), first develop an understanding of the effective prompt formats. And applying these prompts to collect several noisy votes for the input’s true label. They find that these prompts can have very different accuracies and complex dependencies and thus propose to use weak supervision, a procedure for combining the noisy predictions, to produce the final predictions.
## Introduction
To design the *perfect prompt*. This work instead considers aggregating the predictions of multiple effective, yet imperfect, prompts to improve prompting performance over a broad set of models and tasks.
Given a task input, each prompt produces a vote for the input’s true label, and these votes are aggregated to produce a final prediction. In pursuit of high quality prompting via aggregation, they face the following challenges:
1. Effective prompts
2. Scalable collection
3. Prompt aggregation
In this work, they propose ASK ME ANYTHING PROMPTING (AMA), a simple approach that surprisingly enables open-source LLMs with 30x fewer parameters to exceed the few-shot performance of GPT3-175B.
:::info
**Contribution**
1. They identify properties of prompts that improve effectiveness across tasks, model types, and model sizes.
2. They propose a strategy for scalably reformatting task inputs to the effective formats.
3. They propose the use of weak supervision (WS) to reliably aggregate predictions.
:::
## Ask Me Anything Prompting

### Preliminaries
They consider supervised tasks, $(\mathcal{X}, \mathcal{Y})$, where $x \in \mathcal{X}$ is the examole and $y \in \mathcal{Y}$ is the output. Also, they have an unlabled dataset $\mathcal{D} = \{ x_i \}_{i=1}^n$ for which they wish to predict each $y_i$.
A prompt consists of a prompt template:
1. Zero or more in-context task demonstrations
2. Inference example $x$ as shown in Figure 3
They use $p : \mathcal{X} \rightarrow \mathcal{Y}$ to refer the output of the prompted LLM which produces a prediction $\hat{y} = p(x)$. And a collection of $m$ prompts as $\mathbf{P} = [p_1, p_2, \dots, p_m]$, $\phi$ is an aggregator function $\phi : \mathcal{Y}^m \rightarrow \mathcal{Y}$ to produce outputs $\hat{y}$ on each $x$.

### Effective Prompt Formats
First, they explore what makes an effective prompt format towards improving the quality of $\textbf{P}(x)$.
**Standard prompt formats**
1. Questions that **restrict** the model output particular tokens
2. **Cloze-questions** which ask the model to fill in the remaining text
3. Traditional (yes-no, Wh) **free-form questions**
They compare these three prompting formats and make the following observations:
- Open-ended prompts appear to outperform restrictive-prompts.

- The use of open-ended questions over restrictive-prompts can increase the difficulty of mapping open-ended answers to valid output classes.
**Why is the QA prompt format effective?**

<center>
<p> Count here means the frequency of occurrence </p>
</center>
**AMA’s prompt format**
AMA uses a two-step prompting pipeline:
1. Generating questions based on the input
2. Prompting the LLM to answer the generated questions
These prompts are effective, and to further improve performance we next turn to generating and aggregating over *multiple* prompt-outputs for each input.
### Creating Prompt Collections at Scale
They produce prompts in the effective open-ended question-answering format, their insight is to recursively apply the LLM itself using a *chain* of *functional* prompts, referred to as a $\text{prompt()}$-chain.
(a) $\text{question()} : x \rightarrow q$ generates a question $q$ from an input $x$.
(b) $\text{answer()} : q \rightarrow a$ applies the question generated by (a) to the context of $x$ to produce intermediate answers $a$.
### Prompt Aggregation
To aggregate the prompt predictions $\mathbf{P}(x)$ into outputs $\hat{y}$ reliably, they apply tools from weak supervision, a powerful approach for learning high-quality models from weaker sources of signal *without labeled data*.
**Baseline observations**
They compare two baselines for constructing $\mathbf{P}$:
- $\mathbf{P}_{\text{T}}$: varying the prompt template with no overlap in the in-context examples
- $\mathbf{P}_{\text{E}}$: varying the in-context examples for a fixed prompt template, all with $\mid \mathbf{P} \mid = 5$.
They observe the following properties on $\mathbf{P}$:
1. *Varied overall accuracies*
2. *Varied class-conditional accuracies*
3. *Highly-correlated outputs*
The three observations present challenges in aggregating predictions via simple approaches like majority vote (MV). MV tends to do better than using one prompt, but it weights all prompts equally and treats them independently. **Such an aggregation method may be sufficient over certain collections of prompts but is not reliable across general P that may exhibit the three properties we have observed.**
**AMA Aggregation**
Given the varied accuracies and dependencies among $\text{prompt()}$-chains, in AMA we draw on recent work in weak supervision, which is able to account for the accuracy and dependency properties without relying on labeled data.
It will learn a probabilistic graphical model on $\text{Pr}_{G, \theta}(y, \mathbf{P}(x))$ and define the aggregator as $\phi_{\text{WS}}(x) = \arg \max_{y \in \mathcal{Y}} \text{Pr}_{G, \theta}(y \mid \mathbf{P}(x))$.
$G = (V, E)$ is a dependency graph where $V = \{y, \mathbf{P}(x) \}$ and $(p_i(x), p_j(x)) \in E$ where $p_i(x)$ and $p_j(x)$ are conditionally independent given $y$. $\theta$ are the accuracy parameters for $\mathbf{P}(x)$. Since we lack labeled data $y$, estimating $G$ or $\theta$ can not be directly from $\mathcal{D}$.
The method will be:
1. Using a structure learning approach to recover the dependency structure $\hat{G}$ with $\mathbf{P}(x)$ applied to $\mathcal{D}$.
2. Using $\hat{G}, \mathcal{D}$ and $\mathbf{P}(x)$ to learn the accuracies $\theta$ of the prompts $\mathbf{P}$.
3. Computing $\text{Pr}_{\hat{G}, \hat{\theta}}(y \mid \mathbf{P}(x))$ and aggregate the predictions.
## Information Flow in AMA
**Information flow metric**
They examine the conditional entropy, $H(y \mid \hat{y})$, which measures the amount of uncertainty remaining in the true label $y$ given a prediction $\hat{y}$.
$$
H(y \mid \hat{y})=\underbrace{H(y \mid \mathbf{P}(x))}_{\text {Controlled by } \mathbf{P} \text { prompt quality }}+\underbrace{H(y \mid \hat{y})-H(y \mid \mathbf{P}(x))}_{\text {Controlled by aggregation method } \phi}
$$

## Results

## Conclusion
1. Scalably obtains multiple prompts given a task input
2. Combines the intermediate answers to these prompts using weak supervision to give the final prediction
Overall, AMA provides lift across four language model families and across model sizes ranging from 125M-175B parameters.