Try   HackMD

Quark: Controllable Text Generation with Reinforced [Un]learning

tags: RL Group meeting

Quark: Controllable Text Generation with Reinforced [Un]learning github

Outline

  • Information
  • Quark: Quantized Reward Konditioning
  • Experiments
  • Model Ablations
  • Related Work
  • Conclusion

Information

  • We consider the task of unlearning these misalignments by fine-tuning the language model on signals of what not to do.
  • Quantized Reward Konditioning(Quark), an algorithm for optimizing a reward function that quantifies an (un)wanted property.
  • By conditioning on a high-reward token at generation time, the model generates text that exhibits less of the unwanted property.

Compare with RL and Quark.

  • Dynamically (un)learning from sentence-level, scalar feedback is perhaps better suited to the reinforcement learning (RL) paradigm.
    • However, RL is highly sensitive to variance in the reward function, these methods rely on additional models – often doubling the number of learnable parameters – and specialized heuristics to stabilize training.
  • Quantized Reward Konditioning (Quark):
    • Collect samples with the current language model.
    • Sort them into quantiles based on reward.
    • Maximize the likelihood of the samples from each reward quantile conditioned on its reward token.
  • RL methods stabilize training with an additional parameterized model and specialized optimization heuristics, Quark’s training relies only on standard language modeling primitives.

Quark: Quantized Reward Konditioning

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • Initialization
    D0={(x,y,r(x,y))yp0(x), for all xX)}
  • Exploration
    DD{(x,y,r(x,y))ypθ(x,rK), for all xX}
  • Quantization
    • Quark quantizes each example in the datapool based on how high its reward is compared to others in the data pool.
    • Quark sorts the current iteration’s datapool in order of increasing reward, and partitions the sorted pool into equally sized quantiles,
      D1
      , . . . ,
      DK
      .
  • Learning
    maxθEkU(1,K)E(x,y)Dk[logpθ(yx,rk)βt=1TKL(p0(y<t,x)pθ(y<t,x,rk))]

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Relationship to prior work

  • Inspired by PPO
    • r~(x)=r(x)βlogpθ(x)p0(x)
    • maxθEkU(1,K)E(x,y)Dk[logpθ(yx,rk)βt=1TKL(p0(y<t,x)pθ(y<t,x,rk))]
    • The modification optimize language model log probabilities directly without the additional hyperparameters of PPO.
  • Inspired by the Decision Transformer
    • We have an exploration step.
    • We don’t attempt to model discounted reward over multiple timesteps, and instead only consider a one-step bandit environment.
  • Inspired by control codes
    • We use learned embeddings as a light-weight representation of reward.
    • Each reward quantile is encoded via an embedding lookup, following past work on style and content controls, or prompt/prefix encodings.

Experiments

Unlearning Toxicity from Language Models

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • Toxicity: Which one is less rude, disrespectful or unreasonable?
  • Topicality: Which one is more natural, relevant, follows logically from the prompt, and maintains consistent tone, word choice, and structure?
  • Fluency: Which one is more grammatically correct and coherent? and which is measured as perplexity of generated output according to a larger GPT-2 model.
  • Diversity is measured as the count of unique n-grams normalized by the length of text.

Qualitative results

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Steering Away from Unwanted Sentiment of Generated Texts

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Qualitative results

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Unlearning Degenerate Repetition

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Qualitative results

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Model Ablations

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • The effect on the KL term.
  • The effect on the number of quantiles.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • The effect on the frequency of exploration.
  • The rewards for generations in each partition evolve over time.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • Reinforcement Learning in NLP.
    • In the domain of open-text generation, REINFORCE and PPO have been usedfor controllable story generation, and soft Q-Learning has been applied to generate prompts for steering language model generations.
    • Finally, prior work has used RL techniques to generate language grounded in text-based narrative games
  • Reinforcement learning with transformers.
    • Use transformers to produce a sequence of actions with high rewards given observed states.
    • Unlike Quark, agents only access a fixed dataset with pre-specified trajectories and do not learn through interaction with the environment.
  • Unlearning undesirable behaviors from language models.
    • Generative Cooperative Networks focus on training models such that a discriminator cannot readily identify machine vs. human authored text, whereas we focus on capturing external factorsvia reward functions.

Conclusion

  • Quark, a simple but effective method for reward optimization to unlearn undesirable properties of language models acquired during pretraining.
  • Quark could be used to steer language models towards malicious behaviors.
  • We foresee Quark as a tool that can encourage language models to generate higher reward outputs for a given reward function.

Future directions include:

  1. Investigating adaptations of Quark for controlling multiple rewards simultaneously.
  2. Exploring more diverse types of rewards, e.g., those related to human preferences.
  3. Training Quark with fewer parameters vs. optimizing all model parameters.

Reference
Code