# RLPrompt: Optimizing Discrete Text Prompts with Reinforcement Learning
###### tags: `Research`, `prompt`, `PTMs`, `RL`
> [Paper link](https://arxiv.org/pdf/2205.12548.pdf) | [Note link](https://zhuanlan.zhihu.com/p/573754697) | [Code link](https://github.com/mingkaid/rl-prompt) | EMNLP 2022 | Seminar 2023/03/30
## Abstract
This paper proposes RLPROMPT, an efficient discrete prompt optimization approach with reinforcement learning (RL).
To harness the complex and stochastic reward signals from the large LM environment, they incorporate effective reward stabilization that substantially enhances training efficiency.
RLPROMPT is flexibly applicable to different types of LMs, such as masked (e.g., BERT) and left-to-right models (e.g., GPTs), for both classification and generation tasks.
Interestingly, **the resulting optimized prompts are often ungrammatical gibberish text**; and surprisingly, **those gibberish prompts are transferrable between different LMs to retain significant performance**, indicating that LM prompting may not follow human language patterns.
## Introduction
A key question with prompting is how to **find the optimal prompts to improve the LM’s performance on various tasks**, often with only a few training examples.
One of the most popular scheme is to **tune soft prompts** (i.e., continuous embedding vectors) as they are amenable to gradient descent.
But the resulting prompts are, by their nature, **hard for humans to understand** and **the required LM internal gradients are often expensive to compute**, or simply unavailable for LMs deployed with only inference APIs.
This paper presents RLPROMPT, a new discrete prompt optimization approach based on reinforcement learning (RL).

Crucially, rather than directly editing the discrete tokens, which has been difficult and inefficient, **RLPROMPT trains a policy network that generates the desired prompts.**
- **Discrete prompt optimization** thus amounts to **learning a small number of policy parameters** which they set as an MLP layer inserted into a frozen compact model such as distilGPT-2, it also allows us to **employ off-the-shelf RL algorithms**.
- Another challenge is **learning efficiency**. Since the large black-box LM presents a highly complex environment that, given the prompt (i.e., actions), goes through a long series of complex transitions (e.g., reading the input and inferring the output) before computing the rewards.
## Discrete Prompt Optimization with RL
They formulate discrete prompt optimization as an RL problem, **using a continuous policy network to explore the prompt space.**
The network is **highly parameter-efficient**, only training a small MLP over a frozen compact LM (e.g., distilGPT-2).
### Discrete Prompt Optimization Problem
Since it is possible to combine discrete text prompt $\textbf{z}$ with input $\boldsymbol{x}$ to directly perform various NLP tasks using a pre-trained LM’s generative distribution $P_{\mathrm{LM}}(\textbf{y} \mid \textbf{z}, \textbf{x})$, without needing to fine-tune the model.
> Classification task:
> LM can be a masked language model (MLM) such as BERT
> $\textbf{y}$ can be class-label token (verbalizer like $\verb|positive|$ or $\verb|negative|$)
> Generation task:
> LM can be a left-to-right model such as GPT-2
> $\textbf{y}$ can be the generated text

==**The method wants to find the optimal discrete prompt $\textbf{z}^*$ from vocabulary $\mathcal{V}$ to maximize some downstream performance measure $R$ of $\textbf{y}_{\mathrm{LM}}(\textbf{z}^*, \textbf{x})$.**==
> Classification task:
> $R(\textbf{y})$ can be match with gold label $\textbf{y}^*$
> Generation task:
> $R(\textbf{y})$ composes aspects such as style accuracy, language quality, and content preservation
**Discrete Prompt Optimization**
$$
\tag{1}\max_{\textbf{z} \in \mathcal{V}^T} R(\textbf{y}_{\mathrm{LM}}(\textbf{z}, \textbf{x}))
$$
where $T$ is the length of the prompt, which is fixed.
But the $\textbf{z}$ is discrete tokens, such that they can't use gradient-based optimization.
And the brute-force search has the time complexity $\mathcal{O}(\mid \mathcal{V} \mid^T)$.
They need to use RL algorithm to handle this task, which is to find a way to determine the optimal prompt.
### The Reinforcement Learning Formulation
With RL way, they use an agent selects prompt tokens $[z_1, \dots, z_T]$ one by one to maximize the reward $R(\textbf{y}_{\mathrm{LM}}(\textbf{z}, \textbf{x}))$.
At time step $t$, the agent receives previous prompt tokens $\textbf{z}_{\lt t}$ and generates the next prompt token $z_t$ according to a policy $\pi(z_t \mid \textbf{z}_{\lt t})$.
After the agent finishes the entire prompt $\hat{\textbf{z}}$, it receives the task reward $R(\textbf{y}_{\mathrm{LM}}(\hat{\textbf{z}}, \textbf{x}))$.
Parameterizing the policy with $\boldsymbol{\theta}$, they can rewrite the problem above as
$$
\tag{2}\max _{\boldsymbol{\theta}} R\left(\mathbf{y}_{\mathrm{LM}}(\hat{\mathbf{z}}, \mathbf{x})\right), \hat{\mathbf{z}} \sim \prod_{t=1}^T \pi_{\boldsymbol{\theta}}\left(z_t \mid \mathbf{z}_{<t}\right)
$$
* Compared to typical (soft) prompt tuning approaches, the RL formulation above has the key advantage of **not needing gradient access to the LM**, treating it instead as a black-box function.
* Compared to previous discrete prompt enumeration/paraphrasing, the RL approach explores the prompt space **more efficiently guided by the reward signals.**
:::info
**Practical issue**
During training, they explore the prompt space by sampling from the policy network.
After the policy is trained, they **select tokens greedily** during inference to produce a deterministic prompt.
**The reward objective in Eq.(2) can be optimized with any off-the-shelf RL algorithm.**
They use the latest **soft Q-learning** which has shown advanced learning efficiency and performance **on various text generation problems.**
Specifically, they use only its **on-policy** component.
:::
### Efficient Parameterization of Policy
The policy LM need not be the same as the LM they optimize the prompt for (i.e., task LM). Figure 1 (left) illustrates the policy LM architecture.
Specifically, they use the LM to **extract contextual embeddings of partial prompt $\textbf{z}_{\lt t}$, apply the added task-specific MLP layer to compute the adapted embeddings, and pass the output into the model’s original LM head to obtain the next prompt token probabilities.**
During training, they **compute the MLP gradients by back-propagating through the policy LM.**
After training, they discard the MLP and simply use the learned discrete text prompt for inference.
### Reward Engineering and Stabilization
To solve the difficulties for building reward function, they propose two simple reward engineering techniques that effectively encourage and stabilize the RL training.
**Input-Specific $z$-Score Reward**
Naively optimizing for all inputs with the same reward scale, therefore, can lead to training bias and instability.
To mitigate this problem, they propose to use input-specific $z$-score, which **normalizes the rewards by input-specific means and standard deviations.**
Formally, during prompt optimization, they sample a batch of prompts $Z(\mathbf{x})$ for each input $\mathbf{x}$, and compute the reward $R\left(\mathbf{y}_{\mathrm{LM}}(\mathbf{z}, \mathbf{x})\right)$ for each prompt $\mathbf{z} \in Z(\mathbf{x})$.
After that, they compute the reward $z$-scores across prompts $Z(\mathbf{x})$. Using the shorthand $R_{\mathbf{x}}(\mathbf{z})$ for $R\left(\mathbf{y}_{\mathbf{L M}}(\mathbf{z}, \mathbf{x})\right)$, namely the reward prompt $\mathbf{z}$ receives for input $\mathbf{x}$, they write the transformation as below:
$$
\tag{3}z-\operatorname{score}(\mathbf{z}, \mathbf{x})=\frac{R_{\mathbf{x}}(\mathbf{z})-\operatorname{mean}_{\mathbf{z}^{\prime} \in Z(\mathbf{x})} R_{\mathbf{x}}\left(\mathbf{z}^{\prime}\right)}{\operatorname{stdev}_{\mathbf{z}^{\prime} \in Z(\mathbf{x})} R_{\mathbf{x}}\left(\mathbf{z}^{\prime}\right)}
$$
To distinguish the $z$-scores of different inputs in the same batch, they condition their policy network on the inputs, i.e., $\pi_\theta(\mathbf{z} \mid \mathbf{x})$.
**Piecewise Reward**
If a reward function is misspecified or vulnerable, the policy may maximize it without moving towards the desired goal.
Such that they propose to design piecewise reward functions with both smooth and disjoint components to better express the task priorities and improve robustness.
## Experiments
### Few-Shot Text Classification
Classification, therefore, amounts to selecting tokens that correspond to a set of predetermined class labels, a.k.a., *verbalizers* (e.g., “$\verb|great|$” for positive sentiment and “terrible” for $\verb|negative|$ sentiment).
For instance, to classify the sentiment of an input sentence “$\verb|food is delicious|$” using an MLM, they first fill our prompt and the input into a template “$\verb|[Input] [Prompt] [MASK]|$”, and then select the verbalizer token with the highest probability of filling into the $\verb|[MASK]|$ position.
**Reward Function**
$$
\tag{4}R(\textbf{x}, c) = \lambda_1^{1- \mathrm{Correct}} \lambda_2^{\rm{Correct}} \rm{Gap}_\textbf{z}(c)
$$
where $c$ is ground true lable from a st of classes $C$, and with the probability of label, it can write gap as $\mathrm{Gap}_{\textbf{z}}(c) = P_{\textbf{z}}(c) -\max_{c^\prime \neq c}P_\textbf{z}(c^\prime)$. The gap value is positive when the prediction is correct, and negative otherwise.
**Baselines**
They compare with **the latest Black-Box (BB) Tuning, which mixes discrete and soft prompts and tunes the soft part.**
**Experiment Setup**
They use RoBERTa-large as the backbone model. And the prompt lengths $T \in \{ 2, 5\}$, and insert the prompt tokens at the **same positions with their manual prompts.**
**Results**

**Training Efficiency**
<center>
<img src = "https://i.imgur.com/xR3JOa6.png">
</center>
### Unsupervised Text Style Transfer
A sentiment transfer task, given a negative sentence “$\verb|The food is disgusting|$”, the model should generate a positive sentence “$\verb|The food is delicious|$”, without training on such paired data.
Even without supervision data, our method can learn prompts with weak reward signals, which is not possible for most previous prompt optimization methods.
**Reward Function**
Given input sentence $\textbf{x}$, the goal of TST is to generate output $\textbf{y}$ that preserves the information in $\textbf{x}$ while showing style attribute $s$.
$$
\tag{5}R(\mathbf{x}, \mathbf{y}, s)=\operatorname{Content}(\mathbf{x}, \mathbf{y})+\operatorname{Style}(\mathbf{y}, s)
$$
Because the reward shows different scales across inputs, the paper normalize the rewards using input-specific $z$-score.
**Baselines**
They compare with two strong training methods, Style Transformer and DiRR.
In particular, DiRR fine-tunes GPT-2 with RL signals, which can be seen as a full-model tuning analogue to our method.
Compare with
- Null prompt
- Random prompt (5 tokens from the vocabulary)
- Manual prompt (3 human-written templates)
**Experiment Setup**
GPT-2 of varying sizes, ranging from the smallest distilGPT-2 with 82M parameters to the largest GPT-2-xl with 1.5B parameters.
They fix the prompt length $T=5$.
To generate output $\hat{\textbf{y}}$, for all comparison methods, they sample 32 candidates from the respective models, and select the one with the highest reward.
**Evaluation**
They perform both automatic and human evaluation on the content preservation, style accuracy, and fluency of model outputs.
- Automatic evaluation
- $Content$ by the state-of-the-art input-output alignment using pre-trained LM
- $Style$ by fine-tuned style classifiers
- $Fluency$ by a grammaticality classifier
To aggregate the quality dimensions, they average the joint sentence-level scores of examples $\textbf{x}$ from the test set $\mathcal{X}$.
$$
\tag{6}J(\text { Content, Style, Fluency })= \operatorname{mean}_{\mathbf{x} \in \mathcal{X}}(\operatorname{Content}(\mathbf{x}) \cdot \operatorname{Style}(\mathbf{x}) \cdot \text { Fluency }(\mathbf{x}))
$$
And they also report the geometric mean (GM) of the three overall aspect scores.
**Results**
<center>
<img src = "https://i.imgur.com/I8LzBoP.png">
<br> Automatic evaluation results
</center><br>
<center>
<img src = "https://i.imgur.com/yeqnCzB.png">
</center><br>
<center>
<img src = "https://i.imgur.com/ZXpJBNE.png">
<br> Human evaluation results
</center>
**Ablation Study**
As the visualized results in Figures 3 and 6 show, $z$-score normalization achieves both superior performance and more stable improvement across random seeds and training tasks.
Because training easily collapsed without $z$-score using the original hyperparameters, they **tuned the reward shaping scheme to transform a scale of [50,100] into [-50,50], which substantiall improved training stability and results.**
### Analysis
**Fluent vs. Gibberish Prompts**
The fluency-constrained prompts have remarkably lower perplexity, which indicates higher language coherence.
**Transferring Prompts across LMs**
One unique advantage of discrete prompts over soft prompts is they are transferrable across models, due to the common text space instead of the model-specific latent space.
:::success
Interestingly, experiments show that the optimized prompts, though largely gibberish text, can indeed retain significant performance after transferring to different LMs.
Furthermore, prompts can transfer from smaller to larger models for similar or even better performance.
:::
Overall, all prompts can transfer between models, but **the success depends on both the source and target LMs.**
Prompts learned from larger models see sharp performance declines when applied to smaller models. In contrast, prompts learned from smaller models reach similar or better performance on larger models.
**Robustness to Classification Verbalizers**
In few-shot classification, their method can discover well-performing prompts given a wide variety of verbalizers.
<center>
<img src = "https://i.imgur.com/hW4LLF0.png">
</center>
## Related Work
- Pre-trained LMs is fine-tuning on downstream datasets
- Manual prompts can steer large LMs to perform NLP tasks without any training
- Developing instructional prompts which provide task descriptions instead of fill-in-the-blank questions
- Replacing discrete prompts with continuous embeddings by tuning soft prompts using gradient descent
- Locating better discrete prompts by augmenting human-written prompts with heuristics
- Paraphrasing
- Editing
- Selecting by some downstream metric
## Conclusion
RLPROMPT, an **efficient** and **flexible** approach for discrete prompt optimization using RL, which improves over a wide range of fine-tuning and prompting methods in experiments on few-shot classification and unsupervised text style transfer.
The observation opens up many promising possibilities for prompting, such as learning prompts cheaply from smaller models and performing inference with larger models.
## Limitations
They have not experimented with more recent huge models like GPT-3.
As is the case for typical RL methods, designing reward functions may need domain expertise.
However, they may solve this problem using techniques such as inverse RL, which learns the reward function from data.
## Appendix
### A.1 Policy Network
For all tasks, **they uniformly use distilGPT-2 with 82M parameters as a compact policy LM, and implement a generously parameterized MLP with 1 hidden layer and 2048 hidden states.**
Given distilGPT-2’s hidden size of 768, they only add 3.1M parameters, or 3.8% of the LM parameters.