Tractable Control for Autoregressive Language Generation

2023/08/22

tags: `RL Group meeting`

Outline

Introduction
Related work
Guiding Autoregressive Generation with Tractable Probabilistic Models
Efficient Probabilistic Reasoning with Hidden Markov Models
Experiments
Conclusion

Introduction

Autoregressive large language models remain a major challenge to generate text that satisfies complex constraints:
- Sampling from the conditional distribution
  $\Pr (text | α)$ is intractable for even the simplest lexical constraints
  $α$ .
We propose to use tractable probabilistic models (TPMs) to impose lexical constraints in autoregressive text generation models, which we refer to as GeLaTo (Generating Language with Tractable Constraints).
Our goal is to generate text effectively following the conditional distribution
$\Pr_{LM} (x_{1 : n} | α)$ for arbitrary lexical constraints α.
- TPMs can efficiently compute the joint probability distribution over the input sequence and the constraints, which allows for more precise control over the generation process.
- Pre-trained LMs only model the next token distribution given some prefix, and conditioning on constraints can be intractable even for simple constraints.
We use distilled hidden Markov models
1. We can efficiently compute
  $\Pr (text | α)$ , to guide autoregressive generation from GPT2.
2. We propose a dynamic programming algorithm that efficiently computes conditional probabilities
  $\Pr_{HMM} (\cdot | α)$
Our study demonstrates the potential of TPMs in controlling large language models and motivates the development of more expressive TPMs.

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

We train a TPM
$\Pr_{TPM}$ via maximum likelihood estimation (MLE) on samples drawn from
$\Pr_{LM}$ , which is equivalent to minimizing the KL-divergence between
$\Pr_{TPM}$ and
$\Pr_{LM}$ ;
At generation time, we compute
$\Pr_{TPM} (x_{t + 1} | x_{1 : t}, α)$ efficiently and combine it with
$\Pr_{LM} (x_{t + 1} | x_{1 : t})$ to approximate
$\Pr_{LM} (x_{t + 1} | x_{1 : t}, α)$ for reliable control.

Tractable probabilistic models

A class of queries
$Q$ is tractable on a family of probabilistic models
$M$ iff any query
$q \in Q$ on a model
$m \in M$ can be computed in time
$O ($ poly
$(| m |))$ .
We also say that
$M$ is a tractable model for
$Q$ .
Tractable probabilistic models support efficient probabilistic inference.
Probabilistic circuits (PCs) is a unified framework for a large family of tractable probabilistic models:
- hidden Markov models
- bounded tree-width graphical models
- sum-product networks (SPNs)

Controllable Autoregressive Language Generation

One line of research on constrained text generation focuses on modifying the decoding algorithm to inject constraints into the beam search process
- Search-based
  - constrained beam search
  - NeuroLogic Decoding
  - A*esque NeuroLogic Decoding
- Token-level
  - NADO
  - FUDGE
- Insertion-based

Guiding Autoregressive Generation with Tractable Probabilistic Models

Our goal is to generate from the following conditional distribution:

$\Pr_{LM} (x_{1 : n} ∣ α) = \prod_{t} \Pr_{LM} (x_{t + 1} ∣ x_{1 : t}, α)$
- $\Pr_{LM} (x_{t + 1} | x_{1 : t}, α)$ is intractable
- We can assume that
  $\Pr_{TPM} (x_{t + 1} | x_{1 : t}, α)$ can be efficiently computed.
We train the TPM model via MLE:

$E_{x_{1 : n} \sim \Pr_{LM}} \log \Pr_{TPM} (x_{1 : n})$
Which effectively minimizes their KL-divergence:

$\begin{aligned} D_{KL} (\Pr_{LM} ‖ \Pr_{TPM}) \\ = E_{x_{1 : n} \sim \Pr_{LM}} \log \Pr_{LM} (x_{1 : n}) - E_{x_{1 : n} \sim \Pr_{LM}} \log \Pr_{TPM} (x_{1 : n}) \end{aligned}$
We assume that there exists some “quality” constraint
$β$ such that
$\Pr_{TPM} (| β)$ is even closer to
$\Pr_{LM}$ .

$\Pr_{TPM} (x_{1 : n} ∣ α, β) = \prod_{t} \Pr_{TPM} (x_{t + 1} ∣ x_{1 : t}, α, β)$
We assume the key independence assumption:

$\begin{aligned} \Pr_{TPM} (x_{t + 1} ∣ x_{1 : t}, α, β) \\ \propto \Pr_{TPM} (α ∣ x_{1 : t + 1}, β) \cdot \Pr_{TPM} (x_{t + 1} ∣ x_{1 : t}, β) \\ \propto \Pr_{TPM} (α ∣ x_{1 : t + 1}) \cdot \Pr_{LM} (x_{t + 1} ∣ x_{1 : t}) . \end{aligned}$
- Unsupervised setting
  - Assume that the base pre-trained LM is not fine-tuned given task-specific supervision.
  - It may still be adapted to generate text in a specific domain or context.
    
    $p (x_{t + 1} ∣ x_{1 : t}, α) \propto \Pr_{TPM} (α ∣ x_{1 : t + 1}) \cdot \Pr_{LM} (x_{t + 1} ∣ x_{1 : t}) .$
- Supervised setting
  - Assume that
    $\Pr_{LM}$ is fine-tuned in a sequence-tosequence manner.
  - We adopt an alternative formulation by viewing
    $\Pr_{TPM} (x_{t + 1} | x_{1 : t}, α)$ and
    $\Pr_{LM} (x_{t + 1} | x_{1 : t})$ as classifiers trained for the same task yet with different biases.
    
    $\begin{aligned} p (x_{t + 1} ∣ x_{1 : t}, α) \\ \propto \Pr_{TPM} {(x_{t + 1} ∣ x_{1 : t}, α)}^{w} \cdot \Pr_{LM} {(x_{t + 1} ∣ x_{1 : t})}^{1 - w} \end{aligned}$
To summarize, GeLaTo consists of two major steps:
- Distillation - We train a TPM on samples drawn from the pretrained LM via MLE to effectively minimize the KL divergence between
  $\Pr_{LM}$ and
  $\Pr_{TPM}$ .
- Probabilistic reasoning: for each step of autoregressive generation, we compute
  $\Pr_{TPM} (\cdot | α)$ and generate from the conditional next-token distribution
  $p (x_{t + 1} | x_{1 : t}, α)$ defined above.
Two advantages:
- The sentences generated following
  $p (x_{t + 1} | x_{1 : t}, α)$ are guaranteed to satisfy the lexical constraint α.
- The TPM training is independent of the lexical constraint α, which is only enforced at inference time.
  - No need to re-train the TPM model no matter how α changes.

Efficient Probabilistic Reasoning with Hidden Markov Models(HMMs)

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

We need to compute
$\Pr_{TPM} (x_{1 : t}, α)$ :
- unsupervised setting:
  $\Pr (α | x_{1 : t + 1})$ =
  $\Pr (x_{1 : t + 1}, α) / \Pr (x_{1 : t + 1})$
- supervised setting:
  $\Pr (x_{t + 1} | x_{1 : t}, α) \propto \Pr (x_{1 : t + 1}, α)$
We describe a dynamic programming algorithm that computes
$\Pr (x_{1 : t}, α)$ for HMMs, where α is some lexical constraint encoded in a conjunctive normal form (CNF):

$(I (w_{1, 1}) \lor \dots \lor I (w_{1, d_{1}})) \land \dots \land (I (w_{m, 1}) \lor \dots \lor I (w_{m, d_{m}}))$
- $w_{i, j}$ is a string of tokens.
- $I (w_{i j})$ is the indicator variable that represents whether
  $w_{i j}$ appears in the generated text.

Hidden Markov Models

The joint probability
$\Pr (x_{1 : n}, z_{1 : n})$ is defined as:

$\Pr (x_{1} ∣ z_{1}) \Pr (z_{1}) \prod_{2 \leq t \leq n} \Pr (x_{t} ∣ z_{t}) \Pr (z_{t} ∣ z_{t - 1})$
The parameters of HMM are given by the initial probability
$\Pr (z_{1})$ , emission matrix
$\Pr (x_{t} | z_{t})$ and the transition matrix
$\Pr (z_{t + 1} | z_{t})$ , which stay the same across different positions t.

$\Pr (x_{t : n} ∣ z_{t}, x_{1 : t - 1}) = \Pr (x_{t : n} ∣ z_{t}) .$
forward algorithm:

$\begin{aligned} \Pr (x_{1 : t}, z_{t}) & = \sum_{1 \leq z_{t - 1} \leq h} \Pr (x_{t} ∣ z_{t}) \Pr (z_{t} ∣ z_{t - 1}) \Pr (x_{t - 1}, z_{t - 1}) \end{aligned}$
$\Pr_{HMM} (x_{1 : n})$ effectively defines a distribution over all texts with length ≤ n.

An Efficient Dynamic Programming Algorithm

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

$α'$ is some CNF formula obtained by removing from the original
$α$ the clauses that are already satisfied.
$x_{l : r}$ is either the empty string or a suffix for some keystring in
$α$ ,
$ψ$ is a CNF consisting of a subset of clauses in
$α$ and
$z_{l}$ is a latent state for
$Z_{l}$ .
$S (x, α) := {s : \exists x^{'} a suffix of x s.t. x^{'} \oplus s lies in α}$
Case 1.
$x_{l : r}$
$\neq \emptyset$ ; then

$\begin{aligned} \Pr (x_{l : r}, α_{l : n} ∣ z_{l}) \\ = \sum_{z_{r + 1}} \underset{―}{\Pr (x_{l : r}, z_{r + 1} ∣ z_{l})} (\Pr (α_{r + 1 : n} ∣ z_{r + 1}) \\ + \sum_{s \in S (x_{l : r}, α)} \Pr (s_{r + 1 : r + | s |}, {(α ∖ x_{l : r} \oplus s)}_{r + 1 : n} ∣ z_{r + 1}) \\ - \sum_{s \in S (x_{l : r}, α)} \Pr (s_{r + 1 : r + | s |}, α_{r + 1 : n} ∣ z_{r + 1})); \end{aligned}$
Case 2.
$x_{l : r}$
$= \emptyset$ ; we reduce the problem to Case 1 by enumerating
$x_{l}$ over the vocabulary:

$\Pr (α_{l : n} ∣ z_{l}) = \sum_{x_{l} \in vocabulary} \Pr (x_{l}, α_{l : n} ∣ z_{l})$
At step t by computing
$\Pr (x_{1 : t - 1}, x_{t}, α_{1 : n})$ , where
$x_{1 : t - 1}$ denotes the first
$t - 1$ tokens that have been generated:

$\Pr (x_{1 : t}, α_{1 : n}) = \sum_{z_{1}} \Pr (z_{1}) \Pr (x_{1 : t}, α_{1 : n} ∣ z_{1})$
the time complexity of GeLaTo is O(2|α|nm)
- |α| is the number of clauses in α
- n is the maximum sequence length
- m is the number of different suffixes for all keystrings in α.

Experiments

Fine-tuning GPT2-large
- domain adaptation
- sequence-to-sequence
Training HMMs
- To enforce lexical constraint in autoregressive. generation
Constraint Formulation

$\begin{aligned} [I (catch) \lor I (caught) \lor \dots] \\ \land & [I (fr \oplus is \oplus bee) \lor I (fr \oplus is \oplus bees) \lor \dots] \\ \land & [I (snow) \lor I (snow \oplus ing) \lor I (snow \oplus ed) \lor \dots] \end{aligned}$
Decoding
- We adopt beam search to greedily search for
  $x_{1 : n}$ that maximizes
  $p (x_{1 : n} | α)$ .
Metrics
- ROUGE
- BLEU
- CIDEr
- SPICE

Conclusion

We propose GeLaTo, where we use tractable probabilistic models (TPMs) to impose complex lexical constraints (denoted α) in autoregressive language generation from large language models.
With hidden Markov model as a running example:
- We present an efficient dynamic programming algorithm for conditioning HMMs on complex lexical constraints.
- We demonstrate the effectiveness of GeLaTo on various constrained generation benchmarks.

Appendix

Autoregressive model

An autoregressive language model is a type of Machine Learning model that uses autoregressive techniques to predict the next word in a sequence of words based on the words that have come before it.
- $y (t) = c + w_1 y (t - 1) + w_2 y (t - 2) + \dots + w_p y (t - p) + e (t)$

HMMs

A Hidden Markov Model (HMM) is a statistical model used to describe a sequence of observable events or symbols in terms of an underlying sequence of hidden states.
Given a sequence of observations, the goal of HMMs is to find the most likely sequence of hidden states that generated those observations.

Tractable Control for Autoregressive Language Generation

tags: RL Group meeting

Outline

Introduction

Related work

Tractable probabilistic models

Controllable Autoregressive Language Generation

Guiding Autoregressive Generation with Tractable Probabilistic Models

Efficient Probabilistic Reasoning with Hidden Markov Models(HMMs)

Hidden Markov Models

An Efficient Dynamic Programming Algorithm

Experiments

Conclusion

Appendix

Autoregressive model

HMMs

Read more

Contrastive Disentanglement for Coherent Empathetic Dialogue

Towards a Unified Framework of Contrastive Learning for Disentangled Representations, NIPS

How to measure hallucination

CONT: Contrastive Neural Text Generation

tags: `RL Group meeting`