Quark: Controllable Text Generation with Reinforced [Un]learning

# Quark: Controllable Text Generation with Reinforced [Un]learning ###### tags: `RL Group meeting` [Quark: Controllable Text Generation with Reinforced [Un]learning github](https://github.com/GXimingLu/Quark) ## Outline - Information - Quark: Quantized Reward Konditioning - Experiments - Model Ablations - Related Work - Conclusion ## Information - We consider the task of unlearning these misalignments by fine-tuning the language model on signals of what not to do. - Quantized Reward Konditioning(Quark), an algorithm for optimizing a reward function that quantifies an (un)wanted property. - By conditioning on a high-reward token at generation time, the model generates text that exhibits less of the unwanted property. ### Compare with RL and Quark. - Dynamically (un)learning from sentence-level, scalar feedback is perhaps better suited to the reinforcement learning (RL) paradigm. - However, RL is highly sensitive to variance in the reward function, these methods rely on additional models – often doubling the number of learnable parameters – and specialized heuristics to stabilize training. - Quantized Reward Konditioning (Quark): - **Collect** samples with the current language model. - **Sort** them into quantiles based on reward. - **Maximize** the likelihood of the samples from each reward quantile conditioned on its reward token. - RL methods stabilize training with an additional parameterized model and specialized optimization heuristics, Quark’s training relies only on **standard language modeling primitives**. ## Quark: Quantized Reward Konditioning ![](https://i.imgur.com/Iz9HkQT.png) - Initialization $$\left.\mathcal{D}_0=\left\{(x, y, r(x, y)) \mid y \sim p_0(\cdot \mid x) \text {, for all } x \in X\right)\right\} $$ - Exploration $$\mathcal{D} \leftarrow \mathcal{D} \cup\left\{(x, y, r(x, y)) \mid y \sim p_\theta\left(\cdot \mid x, r_K\right) \text {, for all } x \in X\right\}$$ - Quantization - Quark quantizes each example in the datapool based on how high its reward is compared to others in the data pool. - Quark sorts the current iteration’s datapool in order of increasing reward, and partitions the sorted pool into equally sized quantiles, ${D^1}$, . . . , ${D^K}$. - Learning $$\max _\theta \mathbb{E}_{k \sim \mathcal{U}(1, K)} \mathbb{E}_{(x, y) \sim \mathcal{D}^k}\left[\log p_\theta\left(y \mid x, r_k\right)-\beta \sum_{t=1}^T \operatorname{KL}\left(p_0\left(\cdot \mid y_{<t}, x\right) \| p_\theta\left(\cdot \mid y_{<t}, x, r_k\right)\right)\right]$$ ![](https://i.imgur.com/lrTRd3H.png) ![](https://i.imgur.com/H6D9Sbd.png) ### Relationship to prior work - Inspired by PPO - $\tilde{r}(x)=r(x)-\beta \log \frac{p_\theta(x)}{p_0(x)}$ - $\max _\theta \mathbb{E}_{k \sim \mathcal{U}(1, K)} \mathbb{E}_{(x, y) \sim \mathcal{D}^k}\left[\log p_\theta\left(y \mid x, r_k\right)-\beta \sum_{t=1}^T \operatorname{KL}\left(p_0\left(\cdot \mid y_{<t}, x\right) \| p_\theta\left(\cdot \mid y_{<t}, x, r_k\right)\right)\right]$ - The modification optimize language model log probabilities directly without the additional hyperparameters of PPO. - Inspired by the Decision Transformer - We have an exploration step. - We don’t attempt to model discounted reward over multiple timesteps, and instead only consider a one-step bandit environment. - Inspired by control codes - We use learned embeddings as a light-weight representation of reward. - Each reward quantile is encoded via an embedding lookup, following past work on style and content controls, or prompt/prefix encodings. ## Experiments ### Unlearning Toxicity from Language Models ![](https://i.imgur.com/27lEpFl.png) ![](https://i.imgur.com/Epnz9Ga.png) - Toxicity: Which one is less rude, disrespectful or unreasonable? - Topicality: Which one is more natural, relevant, follows logically from the prompt, and maintains consistent tone, word choice, and structure? - Fluency: Which one is more grammatically correct and coherent? and which is measured as perplexity of generated output according to a larger GPT-2 model. - Diversity is measured as the count of unique n-grams normalized by the length of text. #### Qualitative results ![](https://i.imgur.com/TUtqFOC.png) ### Steering Away from Unwanted Sentiment of Generated Texts ![](https://i.imgur.com/dubVUQh.png) ![](https://i.imgur.com/mIvduQ3.png) #### Qualitative results ![](https://i.imgur.com/Z0ONsoH.png) ### Unlearning Degenerate Repetition ![](https://i.imgur.com/TCW1WIB.png) ![](https://i.imgur.com/vabID8R.png) #### Qualitative results ![](https://i.imgur.com/MK6cAgQ.png) ## Model Ablations ![](https://i.imgur.com/ApZNIMC.png) - The effect on the KL term. - The effect on the number of quantiles. ![](https://i.imgur.com/cEgFStz.png) - The effect on the frequency of exploration. - The rewards for generations in each partition evolve over time. ![](https://i.imgur.com/LluqQmi.png) ## Related Work - Reinforcement Learning in NLP. - In the domain of open-text generation, REINFORCE and PPO have been usedfor controllable story generation, and soft Q-Learning has been applied to generate prompts for steering language model generations. - Finally, prior work has used RL techniques to generate language grounded in text-based narrative games - Reinforcement learning with transformers. - Use transformers to produce a sequence of actions with high rewards given observed states. - Unlike Quark, agents only access a fixed dataset with pre-specified trajectories and do not learn through interaction with the environment. - Unlearning undesirable behaviors from language models. - Generative Cooperative Networks focus on training models such that a discriminator cannot readily identify machine vs. human authored text, whereas we focus on capturing external factorsvia reward functions. ## Conclusion - Quark, a simple but effective method for reward optimization to unlearn undesirable properties of language models acquired during pretraining. - Quark could be used to steer language models towards malicious behaviors. - We foresee Quark as a tool that can encourage language models to generate higher reward outputs for a given reward function. ### Future directions include: 1. Investigating adaptations of Quark for controlling multiple rewards simultaneously. 2. Exploring more diverse types of rewards, e.g., those related to human preferences. 3. Training Quark with fewer parameters vs. optimizing all model parameters. [Reference](https://arxiv.org/pdf/2205.13636.pdf) [Code](https://github.com/GXimingLu/Quark)