Quark: Controllable Text Generation with Reinforced [Un]learning
Quark: Controllable Text Generation with Reinforced [Un]learning github
Outline
- Information
- Quark: Quantized Reward Konditioning
- Experiments
- Model Ablations
- Related Work
- Conclusion
- We consider the task of unlearning these misalignments by fine-tuning the language model on signals of what not to do.
- Quantized Reward Konditioning(Quark), an algorithm for optimizing a reward function that quantifies an (un)wanted property.
- By conditioning on a high-reward token at generation time, the model generates text that exhibits less of the unwanted property.
Compare with RL and Quark.
- Dynamically (un)learning from sentence-level, scalar feedback is perhaps better suited to the reinforcement learning (RL) paradigm.
- However, RL is highly sensitive to variance in the reward function, these methods rely on additional models – often doubling the number of learnable parameters – and specialized heuristics to stabilize training.
- Quantized Reward Konditioning (Quark):
- Collect samples with the current language model.
- Sort them into quantiles based on reward.
- Maximize the likelihood of the samples from each reward quantile conditioned on its reward token.
- RL methods stabilize training with an additional parameterized model and specialized optimization heuristics, Quark’s training relies only on standard language modeling primitives.
Quark: Quantized Reward Konditioning
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- Initialization
- Exploration
- Quantization
- Quark quantizes each example in the datapool based on how high its reward is compared to others in the data pool.
- Quark sorts the current iteration’s datapool in order of increasing reward, and partitions the sorted pool into equally sized quantiles, , . . . , .
- Learning
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Relationship to prior work
- Inspired by PPO
- The modification optimize language model log probabilities directly without the additional hyperparameters of PPO.
- Inspired by the Decision Transformer
- We have an exploration step.
- We don’t attempt to model discounted reward over multiple timesteps, and instead only consider a one-step bandit environment.
- Inspired by control codes
- We use learned embeddings as a light-weight representation of reward.
- Each reward quantile is encoded via an embedding lookup, following past work on style and content controls, or prompt/prefix encodings.
Experiments
Unlearning Toxicity from Language Models
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- Toxicity: Which one is less rude, disrespectful or unreasonable?
- Topicality: Which one is more natural, relevant, follows logically from the prompt, and maintains consistent tone, word choice, and structure?
- Fluency: Which one is more grammatically correct and coherent? and which is measured as perplexity of generated output according to a larger GPT-2 model.
- Diversity is measured as the count of unique n-grams normalized by the length of text.
Qualitative results
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Steering Away from Unwanted Sentiment of Generated Texts
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Qualitative results
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Unlearning Degenerate Repetition
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Qualitative results
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Model Ablations
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- The effect on the KL term.
- The effect on the number of quantiles.
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- The effect on the frequency of exploration.
- The rewards for generations in each partition evolve over time.
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- Reinforcement Learning in NLP.
- In the domain of open-text generation, REINFORCE and PPO have been usedfor controllable story generation, and soft Q-Learning has been applied to generate prompts for steering language model generations.
- Finally, prior work has used RL techniques to generate language grounded in text-based narrative games
- Reinforcement learning with transformers.
- Use transformers to produce a sequence of actions with high rewards given observed states.
- Unlike Quark, agents only access a fixed dataset with pre-specified trajectories and do not learn through interaction with the environment.
- Unlearning undesirable behaviors from language models.
- Generative Cooperative Networks focus on training models such that a discriminator cannot readily identify machine vs. human authored text, whereas we focus on capturing external factorsvia reward functions.
Conclusion
- Quark, a simple but effective method for reward optimization to unlearn undesirable properties of language models acquired during pretraining.
- Quark could be used to steer language models towards malicious behaviors.
- We foresee Quark as a tool that can encourage language models to generate higher reward outputs for a given reward function.
Future directions include:
- Investigating adaptations of Quark for controlling multiple rewards simultaneously.
- Exploring more diverse types of rewards, e.g., those related to human preferences.
- Training Quark with fewer parameters vs. optimizing all model parameters.
Reference
Code